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ABSTRACT 

The  iterative  ellipsoidal  trimming  algorithm  is  in¬ 
troduced  as  both  a  clustering  method  and  an  estimator  of 
location  and  shape.  Its  power  as  a  data  analytic  tool  is 
investigated  and  the  asymptotic  distribution  of  its 
stationary  point  is  derived.  In  addition,  several  scale 
estimators  are  proposed  and  studied. 
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Chapter  1 
Introduction 


It  is  not  uncommon  in  scientific  or  technological 
work  for  an  investigator  to  be  confronted  with  a  collection 
of  entities  and  for  him  then  to  wonder  whether  that  collection 
is,  in  some  sense,  homogeneous  or  whether,  instead,  it  is 
made  up  of  several  distinct  subgroups.  That  branch  of 
statistics  known  as  cluster  analysis  is  concerned  with 
providing  a  body  of  techniques  which  will  be  generally  use¬ 
ful  in  discovering  subpopulations.  it  j.s  to  this  subject 
that  this  dissertation  seeks  to  make  a  contribution. 

The  central  aim  of  this  work  is  to  introduce  what  we 

shall  refer  to  as  the  iterative  ellipsoidal  trimming  algorithm 

(for  brevity's  sake,  IET)  as  a  method  for  discovering  clusters 

and  to  study  some  of  its  properties.  We  define  IET  as  follows 

£ 

Suppose  that  we  have  n  observations  in  R  :  X^,...,X  .  To 
start  the  algorithm,  initial  estimates  (starting  values) 
of  the  mean  and  covariance  of  the  cluster  being  sought  must 
be  provided:  (u Q,  Zg)  .  Sometimes,  when  we  are  in  complete 
ignorance  of  the  distribution  of  the  X's,  it  is  appropriate 
to  let  Uq  =  X;  in  other  situations  jig  may  be  derived 
from  previous  analysis  or  it  may  be  an  arbitrary  point  in  a 
certain  region  of  R  .  Usually,  Zg  is  taken  to  be  1^, 


the  k  x  k  identity  matrix.  To  perform  an  iteration,  one 
specifies  a  p,  which  represents  the  proportion  of  the 
observations  to  be  included  in  the  computation  of  the  next 
estimates,  (y,  X)  ,  and  calculates  the  Mahalanobis  distance 
of  each  Xi  from  :  D?  =  (X^.  -  u0)7~1(Xi  -  MqK  Then, 


y  =  [np] 


E  X. 


X  =  [np]_J-  E  (X.  -  y)  (X.  -  y)  ' 
ieL  1  1 


2  2  2  tu 

where  L  =  {i  :  Df  <  D([np])^»  D(r)  is  the  r  order 

2 

statistic  of  the  Di's,  and  [t]  is  the  greatest  integer  <_  t. 

Of  course,  one  performs  the  next  iteration  by  again  choosing 
a  p  and  then  treating  (u,  /)  as  the  new  (yQ,  Xq) •  We 
will  say  that  IET  has  converged  (for  fixed  p)  if  on  two  suc¬ 
cessive  iterations  we  find  that  L,  the  set  of  indices,  does  not 
change.  Equivalently,  we  will  say  that  IET  has  converged 
if  (y,  X)  stays  the  same  on  two  successive  iterations. 

It  is  appealing  to  call  this  final  estimate  a  stationary 
point  (of  the  sample) .  Sometimes,  it  is  too  time  consuming 
to  wait  for  IET  to  converge;  then  it  is  reasonable  to  simply 
continue  until  successive  changes  in  the  estimates  (u,  X) 
are  sufficiently  small. 


In  successive  iterations  of  the  algorithm,  p  may  be 
allowed  to  change  or  it  may  be  kept  constant;  often  this 
decision  may  fruitfully  be  made  interactively,  that  is  to 
say,  after  looking  at  the  last  (y,  X) •  The  choice  of  a 
sequence  of  p’s  will  depend  on  the  goal  of  the  analysis. 

We  will  have  two  separate  but  closely  related  intentions 
in  mind.  First,  we  wish  to  find  large  clusters  and  second, 
having  found  a  large  cluster,  we  wish  to  obtain  robust 
estimates  of  its  mean  and  covariance,  (y,  X),  with  the  idea 
of  using  them  in  the  search  for  smaller  clusters  in  the 
tails  of  the  large  one.  We  hope  to  discuss  the  problem 
of  finding  such  "hidden"  clusters  in  a  subsequent  paper. 

It  is  pertinent  to  remark  now,  however,  that  we  are 
especially  interested  in  IET  because  it  appears  to  yield 
a  plausible  estimator  when  there  is  asymmetric  contamination 
(for  instance,  a  small  cluster)  in  the  tails  of  the  sample. 

In  chapter  2  we  deal  with  the  problem  of  using  IET 
to  discover  clusters.  There  we  also  examine  a  certain 
lack  of  robustness  characteristic  of  "k-means"  algorithms, 
and  how  IET  avoids  this  difficulty.  Chapter  3  contains 
a  discussion  of  the  asymptotic  properties  of  IET,  always 
supposing  that  there  are  no  outliers  (isolated  points  far 
away  from  the  bulk  of  the  observations)  or  small  hidden 
clusters.  Instead,  the  data  will  be  assumed  to  be  purely 


from  some  spherically  symmetric  or  ellipsoidal  distribution. 
The  fourth  chapter  presents  Monte  Carlo  results  concerning 
the  performance  of  IET  in  the  presence  of  outliers  and 
small  clusters. 

Finally,  in  chapter  5  we  take  up  the  question  of 

scale  estimation.  When  data  that  are  far  from  (y,  %)  are 

trimmed,  inevitably  the  next  estimate  of  %  is  biased 

towards  0;  basically,  %  is  an  estimate  of  c%  for  some  c 

such  that  0  <  c  <  1.  Now  this  is  a  matter  of  no  consequence 

when  we  are  only  interested  in  finding  the  location  and 

shape  of  a  cluster.  Furthermore,  IET  is  unaffected  when 

t  is  multiplied  by  a  constant,  since  the  ordering  of  the 

Mahalanobis  distances  is  unchanged.  However,  when  we  want 

2 

to  compare  the  D  of  an  X  to  two  different  clusters, 
it  is  imperative  that  we  "scale  up"  our  estimates  of 
their  X's.  We  must,  in  essence,  estimate  and  divide  out 
the  "c"  referred  to  above. 

Iterative  ellipsoidal  trimming  has  been  investigated 
before  by  other  statisticians,  most  notably  by  Gnanadesikan 
and  his  coworkers  Kettenring  and  Devlin  [5,  7,  S] .  The 
focus  in  the  past,  however,  has  been  on  Monte  Carlo  studies 
of  the  utility  of  this  algorithm  in  the  robust  estimation  of 
covariance  matrices.  It  has  been  shown  to  be  reasonably 
reliable  for  this  purpose  [5] . 
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Chapter  2 
Finding  Clusters 

In  this  chapter  we  shall  discuss  and  illustrate  the 
use  of  IET  in  discovering  clusters  in  a  data  set,  and 
carry  out  a  comparison  with  a  natural  competitor,  the 
k-means  algorithm.  It  is  expected  that  IET  will  be  a  use¬ 
ful  tool  when  one  wishes  to  find  ellipsoidal  clusters  in 
a  high  dimensional  Euclidean  space  (where  by  high  we  mean 
"greater  than  two") .  The  basic  rationale  behind  its  use 
is  that  it  will  tend  to  climb  up  density  gradients,  stopping 
when  it  reaches  an  ellipsoidal  region  containing  [np]  observa¬ 
tions  whose  sample  mean  is  just  its  center,  and  whose  sample 
covariance  will  generate  its  "shape". 

There  are  several  k-means  algorithms,  the  different 
versions  differing  as  to  whether  or  not  a  covariance  matrix 
is  estimated  for  each  cluster  and  as  to  how  many  observations 
are  reclassified  before  the  k  means  (and  possibly,  covariances) 
are  updated.  We  will  consider  two  versions  of  this  general 
method;  both  of  them  update  the  means  and  covariances 
after  reclassifying  all  of  the  data.  The  following  algorithm 
will  be  referred  to  as  k-means  2.  Suppose  x, ,...,X  are 
our  observations  and  (u^ , ),..., (y^ , X^)  are  our  current 
estimates  of  the  means  and  covariances  of  k  clusters.  Let 


Dij  =  ^Xi  “  ^ j ) ' a j  (X,  -  Uj)  and  classify  into 

jnth  cluster  if 


2  2 
D  ■  =  min  DT  . 

130  1< j  <k  13 


Let  Lj  =  {i  :  is  classified  in  the  j  cluster}. 

Then,  the  next  estimates  are 


u!.1^  |L,  I'1  E  X. 
J  J  ieL. 

3 


i(1)=  iL.r1 

J  1  3  ' 


l  (X.  -  y")  (X.  -  y(1))  '  , 
-  T  ■*"  J  x  J 


(where  |sj,  for  a  set  S,  is  the  number  of  elements  S  contains) 
or,  to  put  it  another  way,  the  sample  means  and  covariances 
of  the  new  clusters.  If  we  fix  Zj  =  I,  the  identity  of  the 
appropriate  dimension,  for  all  j  and  update  only  the  means 
after  reclassifying  by  Euclidean  distance,  then  the  above 
algorithm  reduces  to  what  we  shall  call  k-means  1.  For 
further  discussion  of  k-means  algorithms,  one  might  refer 
to  either  Hartigan's  book  on  clustering  algorithms  [9]  or 
the  volume  by  Duda  and  Hart  on  pattern  recognition  [6]. 

Chernoff  appears  to  have  been  the  first  to  suggest  that 
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it  would  be  helpful  to  estimate  the  covariances  of  the 
different  clusters  [3] ,  though  Rohlf  [11]  made  a  similar 
recommendation  in  the  context  of  hierarchical  clustering. 

The  basic  idea  which  underlies  bouh  IET  and  k-means 
is  that  if  one  has  an  approximation  to  the  mean  of  a  cluster 
and  one  computes  a  new  estimate  of  the  mean  based  only  on 
points  currently  thought  to  belong  to  the  cluster  (based  on 
the  approximation) ,  then  the  new  estimate  will  be  better 
than  the  old  one.  In  one  dimension,  in  the  case  of  IET, 
one  imagines  a  situation  like  that  in  Figure  2.1.  There, 
the  new  estimate  is  the  mean  of  all  the  observations  con¬ 
tained  between  the  brackets  and  will  clearly  be  closer  to 
the  true  mean  than  was  the  starting  value . 

Now  there  are  three  important  ways  in  which  k-means 

and  IET  differ.  IET  defines  the  current  cluster  in  a 

rather  conservative  fashion,  namely  as  those  points  within 
2  ~ 

some  D  of  the  current  (u,  Z)  .  On  the  other  hand, 

fch 

k-means  defines  the  j  cluster  as  everything  that  is  closer 
to  (u^,  £  j )  than  to  any  of  the  other  cluster  centers. 
Hence,  a  k-means  cluster  may  have  unbounded  volume.  The 
second  basic  difference  is  that  when  using  IET,  the 
statistician  specifies  the  proportion  of  data  points  to 
be  included  within  the  ellipsoidal  "window"  whose  location, 
shape,  and  orientation  the  algorithm  computes.  Of  course, 


k-means  lets  the  proportions  vary  according  to  the  results 
of  its  classification  scheme.  Finally,  and  by  definition, 
IET  only  searches  for  one  cluster  at  a  time,  while  k-means 
tries  to  locate  k  clusters  simultaneously. 

Several  important  consequences  follow  from  these 
remarks.  Since  a  k-means  cluster  can  be  unbounded,  even 
if  the  starting  value  for  a  given  cluster  is  very  good, 
one  iteration  can  actually  carry  the  cluster  center  far 
away  from  the  true  center,  if  some  outlying  observation 
happens  to  be  newly  classified  into  that  cluster.  An 
outlier  can,  in  essence,  take  hold  of  a  perfectly  good 
cluster  and,  in  some  cases,  ultimately  have  it  all  to 
itself.  In  the  case  of  k-means  2,  a  more  subtle  and,  in 
some  respects,  a  more  disturbing  occurrence  is  possible. 

If  an  outlier  or  an  observation  from  another  population 
is  misclassif ied  into  a  cluster,  it  will  tend  to  increase 
the  estimate  of  its  scale,  perhaps  by  a  considerable  amount. 
Since  Z  will  then  be  quite  "large",  Mahalanobis  distances 
to  that  cluster  will  tend  to  be  small;  as  a  result,  this 
cluster  will  tend  to  absorb  all  of  the  other  clusters. 

This  latter  point  brings  to  mind  our  earlier  comment  that 
k-means  does  not  allow  the  statistician  to  control  the 
number  of  data  points  in  a  cluster. 
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Since  IET  only  looks  at  observations  within  a  bounded 
ellipsoidal  region  that  is  forced  to  contain  a  certain  number 
of  data  points,  it  is  relatively  immune  to  these  sorts 
of  robustness  problems.  It  may  be  objected  that  to  ask 
the  statistician  to  specify  p  (essentially,  the  size  of 
the  region)  is  to  ask  too  much,  as  he  may  have  no  prior 
information  about  the  distribution  of  the  sample.  But  it 
is  our  view  that  IET  is  an  exploratory  tool  (in  the  sense 
that  Tukey  uses  this  expression  [12])  and  it  must  be  used 
in  an  exploratory  spirit.  One  specifies  a  sequence  of  p's 
and  sees  what  the  algorithm  does,  that  is,  where  the  sequence 
of  estimates,  (y,  E)  ,  goes;  one  chooses  a  new  starting  value 
and  a  new  sequence  of  p's  and  observes  again.  If  there  is 
an  ellipsoidal  cluster  to  be  found  and  p  is  taken  to  be 
sufficiently  small  (no  bigger  than  the  proportion  of  points 
in  the  cluster) ,  then  our  experience  suggests  that  IET  is 
likely  to  find  it.  The  basic  rule  of  thumb  is  that  if 
several  different  starting  points  and  different  sequences 
of  p's  lead  to  convergence  to  approximately  the  same  place, 
then  one  should  suspect  that  a  cluster  has  been  found.  One 
should  then  try  increasing  p,  using  the  found  cluster  as  a 
starting  point,  to  get  a  feel  for  how  big  the  cluster  might 
be  -  if  one  increases  p  from  0.4  to  0.6  without  changing 
the  estimate  of  location,  y,  very  much,  then  this  finding 
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both  reassures  us  that  the  cluster  is  real  and  provides  a 
bound  on  its  size.  Of  course  it  will  then  be  important 
to  make  plots  and  perhaps  use  other  devices  to  make  sure 
that  the  sample  really  does  have  a  mode  where  we  think  it 
does.  One  situation  to  beware  of  can  arise  when  p  is 
taken  to  be  bigger  than  any  cluster  in  the  sample  and  is 
best  described  by  the  diagram  in  Figure  2.2.  Here,  each  of 
the  two  clusters,  one  might  imagine,  contains  50%  of  the 
observations  and  p  was  set  at  70%.  Of  course,  given  that 
one  was  seeking  a  cluster  with  70%  of  the  data,  this  result 
(the  distribution  in  the  box)  isn't  so  bad. 

A  further  important  caveat  concerns  the  situation 
when  [np]  is  very  small,  for  then  we  are  likely  to  obtain 
very  unreliable  results.  IET  may  then  have  points  of  con¬ 
vergence  at  many  locations  of  no  interest,  just  because  of 
the  granularity  of  the  distribution  at  that  "window"  size 
(where  by  "window"  size  we  just  mean  the  number  of  observa¬ 
tions  inside  the  ellipsoid) .  One  wants  the  ellipsoidal 
window  to  be  big  enough  so  that  when  we  look  through  it, 
we  can  distinguish  the  trend,  or  signal,  from  the  noise. 

Next  we  will  consider  several  examples  so  as  to 
illustrate  a  number  of  the  above  generalities.  First  we 
review  an  example  on  page  195  of  Duda  and  Hart  [6] .  A 
sample,  consisting  of  8  N(-2,l)  and  17  N(2,l)  observations 


it 
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is  to  be  clustered.  We  find  that  k-means  1  converges  from 
a  variety  of  starting  values  to  two  clusters  with  means 
-2.18  and  1.68  (we  are  assuming  that  we  know  that  there 
are  two  clusters) .  This  happens  to  be  a  perfect  clustering 
a  result  made  possible  by  the  fact  that  the  data  from  the 
two  distributions  happen  to  be  nonoverlapping.  We  find 
that  k-means  2  also  converges  to  these  two  clusters  from 
"good"  starting  values  like  -2,2,  but  not  from  "bad" 
starting  values  like  -2,7.  Here,  however,  we  will  not 
focus  on  the  relevance  of  good  starting  values ,  but  in¬ 
stead  on  the  question  of  robustness  against  outliers. 

Suppose  now  that  one  additional  observation,  at  x  =  25 
is  added  to  the  sample.  We  set  the  starting  values  at  -2,2 
and  apply  k-means  1,  obtaining  the  results  in  Table  2.1. 
Obviously,  what  has  happened  is  that  the  outlier  at  x  =  25 
now  has  the  second  cluster  all  to  itself. 

Next  we  apply  k-means  2  to  the  same  data.  Since  we 

~  2 

are  working  in  one  dimension  we  will  write  a ^  for  J j • 
Taking  the  starting  values  to  be  the  population  parameters, 
we  obtain  the  successive  estimates  in  Table  2.2.  There, 
when  the  variance  of  cluster  2  becomes  large,  most  points 
become  "close"  to  it  and  "far"  from  cluster  1,  whose 
variance  gets  ever  smaller.  What  we  are  observing  here 
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is  an  inherent  instability  in  the  k-means  2  algorithm. 

The  point  of  the  preceding  example  is  that  one 
outlier  can  completely  throw  off  our  search  for  large 
clusters.  In  general,  data  with  many  outliers  will  lead 
to  a  k-means  clustering  which  assigns  the  bulk  of  the 
centrally  located  data  to  one  cluster  and  the  rest  to 
however  many  other  clusters  there  are.  Next  we  examine 
a  more  interesting  example  and  compare  the  performance  of 
IET  and  k-means. 

We  generate  a  sample  of  300  N(u1,  I2 ) /  300  N(u2»  I2) » 

300  N(y^,I2),  and  100  N(y^,4Q0I2)  observations  where  y^  =  (0,0)', 
u2  -  (6,  0)',  and  y^  *  (12,  0)',  and  attempt  to  cluster  it. 
When  k-means  1  is  applied  to  these  data  using  starting 
values  (1,  0)',  (5,  0)’,  (14,  0)',  it  converges  to  three 
clusters  whose  means  are  (2.11,  0.30)’,  (5.80,  0.25)', 

(13.13,  0.31)',  which  is  a  relatively  satisfying  result. 
Unfortunately,  if  the  three  starting  values  are  (-3,  0)', 

(2,  0)',  (7,  0)',  k-mean  1  converges  to  three  clusters 
whose  means  are  (-23.48,  3.55),  (2.28,  0..19),  (12.21,  0.21) 
and  the  first  of  the  clusters  contains  only  27  points. 

When  k-means  2  is  applied  to  the  same  data,  using 
the  population  parameters  for  the  starting  values  of  the 
three  clusters,  the  algorithm  yields,  on  the  fourth 
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iteration  , 


these  means 


and  covariances: 


( 

( 

( 


6.87 

4.22 

10"3 
2  x  10 

0.74 

-0.17 


4.22 

6.03 


2  x  10 
8  x  10 


) 


-0.17 

0.96 


where  the  three  clusters  contain  723,  4,  and  273 
observations,  respectively.  The  third  cluster  is  quite 
good  but  the  first  cluster  has  absorbed  practically  all 
of  the  second  one,  as  well  as  almost  all  of  the  outliers. 
On  the  fifth  iteration,  the  rest  of  cluster  2  is  absorbed 
by  cluster  1.  If  the  algorithm  is  allowed  to  continue, 
using  only  two  clusters  now,  cluster  3  will  also  be 
gradually  absorbed  by  cluster  1.  What  happens  is  that 
the  outer  edges  of  cluster  3  are  nibbled  away  bit  by 
bit;  at  each  stage  its  variances  are  thereby  reduced 
making  all  of  its  remaining  points  closer  to  cluster  1. 

Suppose  now  that  we  undertake  a  search  for  these 


three  clusters,  using  IET.  It  would  be  natural  to  take 
the  sample  mean  and  covariance  of  all  of  the  data  as  a 
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starting  value.  Then,  3  iterations  of  IET ,  with  p  set 
equal  to  0.3,  lead  to  the  estimates 


/  5.94  \  /  1.08  -0.07  \ 

\  -0.05  /  ,  l  -0.07  0.92  I 


which  are  excellent.  If  we  take  y  =  (1,  2) '  and 
Z  -  I2  as  starting  values  and  set  p  =  0.3,  then  IET  com¬ 
putes  the  sequence  of  means:  (0.31,  0.15)',  (0.1,  0.04)', 
(0.05,  -0.01)'  and  stops  with  the  final  covariance 
estimate  being 


/  0,9  0.02  \ 

\  0,02  0.94  /  . 


When,  again,  p  =  0.3  and  the  starting  values  are  y  *  (10,  10)' 
Z  =  I2*  the  sequence  of  sample  means  is  (9.92,  1.38)', 

(10.95,  0.34)',  (11.47,  0.2)’,  11.67,  0.12)', 

(11.79,  0.09)',  (11.87,  0.13)',  (11.90,  0.13)', 

(11.92,  0.13)'  and  the  final  covariance  estimate  is 


0.98 

-0.06 


■0.06  \ 

1.10  /  . 


In  examining  the  pattern  of  convergence  in  the  last  two 


-15- 


trials,  one  is  struck  by  the  fact  that  it  is  similar  to 
that  of  geometric  convergence,  although  the  rate  appears 
to  gradually  change  as  the  stopping  point  is  approached. 

In  the  next  chapter  an  asymptotic  analysis  of  IET  will 
reveal  why  this  phenomenon  occurs. 

So  far  p  has  been  set  to  the  "right"  value  (each 
cluster  containing  30%  of  the  entire  sample  with  an 
additional  10%  contamination  by  outliers) .  Therefore,  one 
might  worry  that  our  success  so  far  is  rather  artificial. 

So  set  p=0.2  and  take  y  =  (4 ,  4)  ’  ,  %  =  I2  as  starting 
values.  Then  the  sequence  of  means  is  (4.54,  0.69)', 

(5.32,  0.45)',  (5.82,  0.3)',  (5.89,  0.22)',  (5.90,  0.13)', 
(5.91,  0.10)',  (5.92,  0.10)’,  (5.93,  0.09)',  (5.93,  0.08)'. 
In  general,  setting  p  at  too  small  a  value  will  not  be 
damaging,  and  we  see  in  the  preceding  trial  that  the  rate 
of  convergence  was  not  even  substantially  affected  by  the 
fact  that  p  was  only  2/3  of  the  true  value  for  the 
cluster  being  sought. 

On  the  other  hand,  when  p  is  bigger  than  the 
proportion  of  points  in  the  cluster,  it  is  possible  to 
encounter  difficulties.  For  instance,  if  we  again  work 
with  the  same  data,  set  p  *  0.5  and  use  the  sample  mean 
and  covariance  of  all  the  observations  as  starting  values, 
after  11  iterations  IET  converges  to  a  cluster  with  mean 
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and  covariance: 

(3-00  \  /  9.98  -0.29 

0.09  I  ,  (  -0.20  0.48 

Furthermore,  upon  examining  the  500  points  included  in 
this  final  cluster,  we  learn  that  251  are  from  the 
N  ((0,  0)',  I 2 )  distribution ,  246  are  from  the 
N  i(6,  0)',  1^,)  distribution,  and  the  other  3  are  outliers 
(from  the  N  ((0,  0)',  4001^)  distribution).  This  cluster 
is  described  in  Figure  2.3. 

We  do  not  claim  that  the  remarks  we  have  made  about 
k-means  are  novel  and  indeed  it  may  be  objected  that  some 
k-means  algorithms  currently  in  use  have  already  been 
immunized  against  at  least  some  of  our  criticisms.  For 
instance,  by  allowing  the  introduction  of  a  new  cluster 
when  an  observation  is  "too  far”  from  all  of  the  current 
clusters,  we  can  prevent  outliers  from  taking  hold  of 
large  clusters  and  pulling  them  far  away  from  the  bulk  of 
the  data.  It  may  even  be  possible,  through  the  use  of 
other  ad  hoc  addenda  to  the  k-means  algorithm,  to  eliminate 
the  instability  in  k-means  2.  Maronna  and  Jacovkis  [10] 
compare  several  different  variable  metric  clustering 
methods,  (including  k-means  2  with  one  observation 


reclassified  per  iteration  and  conclude,  after  noting  the 
instability  we  have  pointed  out,  that  the  only  such 
method  they  would  recommend  is  that  same  k-means  2  but 
with  the  covariances  normalized  by  the  requirement  j % j |  =  1. 
This  method  has  the  weakness  that  it  does  not  really  allow 
for  the  possibility  that  different  clusters  may  have  quite 
different  scales.  Another  approach  would  be  to  estimate 
the  covariance  using  only  the  central  portion  of  a  cluster. 
(One  might  then  use  the  methods  of  chapter  5  to  "scale  up" 
those  estimates.)  Our  purpose  in  discussing  k-means  has 
been  to  highlight  the  basic  differences  between  it  and 
IET ,  which  is  a  natural  competitor.  No  doubt  experienced 
users  of  either  algorithm  will  be  able  to  successfully 
cluster  a  sample  with  genuine  ellipsoidal  clusters. 

Of  course  IET  does  not  need  to  be  "patched  up";  its 
simplest,  most  fundamental  form  is  already,  in  a  certain 
sense,  robust.  An  important  dividend  is  that  the  form  of 
IET  that  we  propose  to  use  in  practice  (and  have  illustrated 
in  this  chapter)  is  sufficiently  simple  for  it  to  admit 
fruitful  asymptotic  analysis,  a  task  which  we  undertake 
in  the  next  chapter. 
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Chapter  3  -  Asymptotics 

In  this  chapter  we  study  the  asymptotic  behavior  of 
IET  when  there  is  only  one  cluster  to  be  found  and  there 
are  no  outliers.  The  word  "asymptotic"  is  used  here  in 
two  different  senses:  we  imagine  that  (1)  the  sample  size 
is  large  or  infinite  and/or  (2)  that  certain  parameters 
describing  the  algorithm  approach  limiting  values.  The 
first  part  of  the  chapter  will  be  concerned  with  zero 
order  asymptotics,  which  is  to  say,  how  the  expectations 
of  certain  quantities  behave  (or,  alternatively,  how 
infinite  samples  behave) .  Later  we  shall  obtain  a  first 
order  asymptotic  result:  the  large  sample  distribution  of 
what  we  shall  call  the  stationary  point  of  IET. 

It  will  be  helpful  to  introduce  some  notation  at  this 
point.  We  suppose  that  f(x)  denotes  a  spherically 
symmetric  density  about  x  =  0  in  k  dimensions,  that  is, 
f(x)  =  c^  g(|xj  ),  where  c^  is  the  normalizing  constant 

2  r  2 

and  f  x i  =  l  x* ,  and  F(x)  is  the  corresponding  cumu- 
i=l  1 

lative  distribution  function  (cdf ) .  Furthermore,  T  =  X'X 
has  the  generalized  chi-squared  distribution  with  k  degrees 
of  freedom  with  density  £^,  given  in  (Al.l)  ,  and  cdf  F.^. 

We  shall  also  speak  of  the  ellipsoidal  distributions 
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i/2 

generated  from  f  :  if  X  ^  f,  then  Y  =  y  +  A  X  has  an 
ellipsoidal  distribution  if  A  is  symmetric  and  positive 
definite.  We  assume  that  g  has  the  property  that  the 
necessary  moments  exist.  Hence  E(Y)  =  y,  and  the  co- 
variance  of  Y,  as  shown  in  Appendix  1,  is  X  =  (2ttc^)  ^c]c+2Ai 
Two  examples  we  shall  refer  to  are  the  spherical  normal 


with  density  f(x)  =  cv^g(x'x),  where  c.  =  ( 2tt ) 


k/2 


and 


g(t)  =  exp(-  t/2) ,  and  the  multivariate  normal,  which  is 
the  corresponding  ellipsoidal  distribution.  (In  this 
case  %  =  A) .  Numerous  properties  of  symmetric  and  normal 
densities  are  collected  in  Appendices  1  and  2. 

We  shall  denote  the  sphere  of  radius  a  centered 
at  5  by  S  ( 5)  =  {x:  (x  -  5)  '  (x  -  5)  <  a2}  .  If  we  define 

3. 


(3.1) 


P  (6)  =  /  f  (x)  dx , 


S  (5) 

3 


and 


(3.2) 


e  (6)  =  /  xf(x)dx, 
a  S  ($) 

3 


then  the  conditional  mean  of  X  is 


e  (a) 

(3.3)  ya(6)  =  p  (5)  =  E ( X i X  -  Sa (5 ; 


1 


-20- 


When  the  sample  size  is  large  and  k  =  1,  u  (5)  is  the 

a. 

approximate  result  of  one  iteration  of  IET  when  <$  is  the 
starting  y. 

The  first  problem  we  shall  consider  is  the  following: 
under  what  conditions  will  one  iteration  of  IET  improve 
the  estimate  of  the  mean  of  the  distribution  no  matter 
what  the  starting  value  is,  when  k  =  1  ? 


Theorem  3.1:  In  the  univariate  case,  when  a  >  0  and 

i 6 !  >  0, 

p  —  F* 

(3.4)  |ua(5)}  <  }6)  iff  0  <  — -  <  2f c 


f ( X ) dX . 
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Proof:  Observe  that 


t  t 

/  xf(x)dx  =  tF(t)  -  sF(s)  -  /  F (x) dx 
s  s 


and 


t 

/  f (x)dx  =  F(t)  -  F(s)  . 
s 


Therefore,  putting  s  =  5-a,  t  =  S+a,  we  have 

6+a 

(6+a)F(6+a)  -  (6-a)F(6-a)  -  /  F(x)dx 
ua(6)  =  F (6+a)  -  F(6-a) 


which  implies 


Ua(6) 


If  5  >  0,  then  |u  (5)  I  <  6  iff  0<F  -  F,  <  2  <S  f  . 
Similarly,  if  6  <  0,  then  ju(5) |  <  -5  iff 
0  ^  Fc  -  F^  _>  23fc.  But  these  results,  taken  together, 
are  equivalent  to  (3.4).  £ 

The  proof  of  this  theorem  did  not  rely  on  the  existence 

of  a  density  f:  we  could  have  replaced  f  by  dF  and  f 
_  ^  5  +a 

by  (2a)  /  dF(x).  The  distribution  of  F  need  not  be 

o-a 

symmetric  about  0,  although  we  stated  Theorem  3.1  that  way 
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because  of  our  special  concern  wtih  such  distributions 
in  this  chapter.  Finally,  we  do  not  even  require  that  F 
have  mean  0.  So  the  theorem  can  be  applied  to  a  finite 
sample  (with  empirical  distribution  function  F  ) . 

Next  we  will  investigate  the  behavior  of  u  (5)  as 
<5  0  while  a  stays  fixed  when,  again,  k  =  1.  In 

essence  we  wish  to  answer  the  question  of  how,  in  one 
dimension,  IET  behaves  when  one  iteration  is  performed  and 
we  are  near  the  true  mean. 

Theorem  3.2:  In  the  univariate  case,  if  f  is  continuous 
in  a  neighborhood  of  x  =  a,  then  if  F(a)  >  1/2, 


(3.5)  ua(<5)  =  F(a)£-?[/26  +  °(6)  as  5  -  °- 

Furthermore,  if  f  i_s  n-1  times  differentiable  in  a  neighbor- 
hood  of  a,  then  (0)  exists,  and  vanishes  when  n  is 

even . 


Proof:  Since  e  (0)  =  0,  it  follows  that  u  (0)  =  0.  Observe 

3.  3 

that  e ' ( 0 )  =  af(a)  +af(-a)  =2af(a)  and  Pa(0)  =  2 (F (a)  -  1/2 ) . 
a  a 

Hence , 


1  e  (0 

u  a  ( 0 )  “  P 


(0) 


e  (O)p'  (0) 

3  ^ 

p2  (0) 

3 


af  ( a) 

P (a)  -  1/2 


-23- 


which  implies  (3.5).  In  general,  the  n  derivatives  of 
P  and  e  are 

CL  3. 


(3.6)  Pin>  (5)  =  f(n-1)(6+a)  -  f(n_1)(6-c 


(3.7)  e^n)(6)  =  (n-1)  [f  (n_2)  (<5+a)  -  f  (n-2)  (<5-a)  ] 


+  [(<5+a)f  (n-1)  (5+a)  -  (6-a) f (n_1) (S-a)] 


Since  f  is  symmetric  about  x  =  0,  f^(x)  =  (-1)  nf  ^  (-x) 


Hence,  (3.6)  and  (3.7)  imply  that 


j  0  n  odd 

(3.8)  P(nl  i  P<n,(0)  = 

2f  (a)  n  even 


(3.9) 


e(n)  =  eM(0l  . 


"2 (n-l) £(n~21  (a)+2af  ln"'L)  (a)  n  odd 


n  even . 


(n)  _  )t  (n) 


Let  u  =  u  (0) .  Then,  since  e  =  P  u  , 
a  a  a  a 


(3.10) 


(n)  =  T  (n)P(k)u(n_1<) 


Now  we  use  an  induction  argument  to  prove  u v  =  0 .  Of 
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course  y^  =  0.  Suppose  u(0)  =  y^  =  ...  =  y2(n-l)  =  0 

s  t 

and  consider  the  (2n+l)  instance  of  (3.10).  Then, 


(2n)  _ 


2n 


=  y  (2n)P(k)u{2n-k) 

k=0  k 


Observe  that  all  terms  with  k  odd  vanish  because  of  (3.8) 
Similarly,  for  k  >  0  and  k  even,  the  corresponding 
terms  vanish  by  the  induction  hypothesis.  Hence, 

But 

,  ( 2n)  n  h 

hence  y  =  0 .  ■ 


(2n)  _  (0)  (2n) 

e  =  P  y 


e^2n^  =  0  by  (3.9)  and 


P(0)  >  0; 


If  we  put 


b  (a) 


af  (a) 

F (a)  -  1/2  ' 


then  the  preceding  theorem  asserts  that  y,(6)  =  b(a)S+  °(6), 

a. 

or  that  for  6  small,  b(a)  represents  the  proportional 
reduction  in  bias  achieved  by  performing  one  iteration  of 
IET .  It  is  interesting  that  b(a)  has  a  simple  geometrical 
interpretation  as  the  ratio  of  the  density  f  at  the  boundary 
of  [-a,  a]  to  the  average  density  of  f  in  [-a,  a], 

(3.11)  b  (a)  =  ^ -  . 

^■(2(F(a)  -  1/2)) 

It  turns  out  that  b(a)  is  the  k=l  version  of  b^ta). 
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the  k-dimensional  bias  reducing  function  that  appears  in 
the  next  theorem  (that  is,  b(a)  =  b^(a)).  We  define 

ci<lg(a2) 

(3.12)  b,  (a)  =  - - 

Fk(a  )/Vk(a) 

where 


Vk(a) 


n 


k/2 


F(J 


+  1) 


k 


is  the  volume  of  a  k-dimensional  sphere  of  radius  a,  given 
in  (A1.5).  The  function  bk(a)  is  the  ratio  of  the  density 
of  X  on  the  boundary  of  Sa(0)  to  the  average  density 
of  X  inside  Sa(0).  (Of  course,  Fk(a2)  =  Pr(x'x  <_a2).) 

In  the  k-dimensional  case,  the  approximate  result  of  one 
iteration  of  IET  is  not  quite  given  by  y  (5)  because  it  depends  not 
only  on  6  but  also  on  the  initial  estimate  of  the  covariance 
matrix.  Nevertheless  the  behavior  of  y,(S)  expressed  in 
Theorem  3.3  is  relevant  to  our  understanding  of  IET. 


Theorem  3.3:  If  g  is  differentiable  in  a  neighborhood 
2 

of  a  ,  then 


la  a  ( 5 )  =  bk(a)<5  +  o(5)  as  6  -*■  0. 


(3.13) 
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Proof;.  By  (3.2),  e  (6)  =  /  xc"Lg (x’ x) dx.  We  change 

-  a  S  (6)  K 

a 

variables  by  setting  y  -  x-6  and  obtain 


(3.14)  e  (6)  =  /  (y+6  )  c^g  (y  1  y  +  2y'5  +  6 ' 6 )  dy . 

S  (0)  * 

a 

Expanding  ea(6)  we  find,  to  first  order,  that 

(3.15)  e  (6)  =  [/  c71g(y 'y)dyl 6 

a  S  ( 0 )  * 

a 

+  (2/  yy'c,  1gl  (y'y)dy]  6  +  o(<5) 

sa(0) 

by  using  the  spherical  symmetry  of  g(y'y).  But  the  first 

2 

term  of  (3.15)  is  just  Fk(a  )5  and  the  second  term  is 
2M35  by  (Al.13)  and  spherical  symmetry.  Substituting 
(A1.19)  for  M3,  we  find  that 


ea«S»  -  2(*2)>5- 


Furthermore,  P,(6)  =  P_(0)  +  °(1) ;  hence, 

a  cl 


.  ,  . , ,  , . .  _  ,  1  Ck+2, 2 f k+2 (a  } , 

(3*16)  ""a  4  "  ( 2-rr  c,  )  P  (0)  4‘ 


Using  (Al.l)  and  (Al.5)  it  is  easily  verified  that  (3.16! 
is  equivalent  to  (3.13).  ft 


i 
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Two  useful  alternative  formulas  for  b^(a)  are  given 

in  (Al.3)  and  (A1.4)  and  some  numerical  values  are  presented 

in  Table  A2 . 1  when  X  ^  N(0,I^).  In  that  table  we  write 
2 

p  =  F,  (a  )  for  the  probability  of  lying  in  S  (0);  alterna- 
k  a 

-1  1  /2 

tively,  a  =  (F,  (p) )  '  .  It  is  interesting  to  study 


,-l 


72, 


bk((Fk  (p) )  7  )  for  fixed  p  as  k  increases.  For 
fixed  p,  F~1(p)  increases  with  k;  therefore,  by  Lemma  A2.1, 
b^  decreases  with  k.  For  k  large,  in  fact,  by  (A2.14), 

-i/2 

bk  =  0(k  7  )  as  k  -*•  °°  when  p  is  fixed.  To  summarize, 

for  a  given  p,  in  high  dimensional  space,  IET  will  converge 
faster  than  in  a  low  dimensional  space.  This  fact  is,  at 
first  glance  at  least,  rather  surprising  and  indeed  it  depends 
on  the  sample  size  being  sufficiently  large,  where  what 
constitutes  "large"  may  itself  grow  with  k. 


Later  in  this  chapter  we  shall  prove  a  generalization 
of  the  preceding  theorem  Theorem  3.6,  which  will  imply 
that  b^  is  still  the  bias  reduction  factor  when  one  computes  the 
expectation  of  X,  given  that  it  lies  in  an  off  center  ellipsoid. 


It  is  natural,  now  that  we  have  studied  the  zero 
order  asymptotics  as  <5  -*•  0 ,  to  undertake  the  analogous 
investigation  as  6  -*■  ■».  Two  formulations  of  this  problem 
are  relevant.  We  can,  first  of  all,  let  the  radius  a  be 
fixed  and  send  6  -*■  »,  thereby  forcing  the  probability  in 


the  spheres,  P  (6)  to  go  to  0.  Alternatively,  we 

cl 

can  fix  p  ( 6 )  =  p;  then,  as  6  -*•  °°,  a  must  also 
approach  00 .  Theorem  3.4  deals  with  the  first  alternative 
while  Theorem  3.5  deals  with  the  second.  Neither  result 
is  as  general  as  it  could  be,  but  both  provide  some  in¬ 
sight  into  the  operation  of  IET. 

Theorem  3.4;  Suppose  X  v  N(0,Ik).  Then, 

(3.17)  lim  | y  ( 6 ) -d |  »  0 
5  -*•  °°  a 

where  d  is  the  point  in  S& ( 6 )  closest  to  0. 

In  order  to  prove  this  result  we  must  make  use  of  a 
rather  technical  lemma,  the  proof  of  which  will  follow 
later.  Note  that  d  =  d(S)  and  let 

(3.18)  T  (6)  =  (x  :  x  e  S  (5)  and  jx|  <id(6)|  +  e} 

£  Si 

and 

(3.19)  U  (5)  =  Sa  (5)  -  T_  (5)  . 

t,  ci  c 

We  shall  also  abbreviate  S  (5)  as  S,. 

a  o 

Lemma  3.1:  Suppose  X  ■v  N(0,I^).  Then, 
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P(T  (5)) 

(3.20)  — pjg  }  +  1  as  6  +  00  V  e  >  0 . 

6 

and 

(3.21)  sup  |x-d|  -*•  0  as  £  0  uniformly  in  6  . 

xeT  (6 ) 
e 

Theorem  3.4  asserts  that  the  conditional  expectation 

of  X  v  N(0,I^)  given  that  X  e  approaches  the  point 

of  maximum  density  in  the  sphere,  namely  d,  as  the  S.'s 

o 

move  out  to  The  two  pieces  of  information  in  Lemma  3.1 

are  that  as  <5  -*■  ”,  more  and  more  of  the  probability  in 
is  actually  contained  in  a  small  subset,  T£(6),  of  S^, 
and  that  every  point  in  T£(5)  is  very  close  to  d  when 
e  is  small. 

Proof  of  Theorem  3.4:  We  may  write 

/  xf(x)dx  +  [  xf(x)dx 
TC(S)  U£(5) 

(3.22)  Ua(o)  =  p(T.  (o)  )  +  P(U_  (<$)  )  ’ 

P(T  (6)  P(S.) 

If  we  set  c6  =  P(U_(5))/P(T£(5))  =  (1  -  -p-(g0  )p(T  f6 y y  , 

then  by  Lemma  3.1,  c^  -*■  0  as  6  ■+  ».  It  is  possible  to 
rewrite  (3.22)  as 


1+C6 


•E  ( X  I  X  £  T  ( 6 )  )  + 


1+c 


-E  ( X  I  X  £  U  (  5  )  ) 


(3.23)  ii-(5) 


which  implies  that 


(3.24)  u  (6)  -  E  (X  |  X  £  T  (6)  )  = 

£ 


(E (X | X  e  U£  (5)  ) 


E  (X  |  X  £  Te  (6)  )  )  . 


Since  both  E(X|X  e  U  (6))  and  E(X  X  e  T  (5))  lie  in 
(a  consequence  of  the  convexity  of  S.),  it  follows 
that  they  can  be  no  farther  apart  than  2a.  Hence  (3.24)) 
implies  that 


(3.25)  |u(5) 

a 


E  (X  |  X  £  T£  ( <5  )  ) 


0  as  5  -*•  oo 


Now  E  (X  |  X  £  T  (6))  £  T  (6)  since  T  (6)  is  convex. 

£  £  E 

Fix  n  >  0  and  choose  £  >  0  so  small  that  for  all  5 

sup  | x— d |  <  n/2 ,  which  we  can  do  as  a  result  of  Lemma  3.1. 
xeT.  ( 5 ) 

Then  it  follows  that 


(3.26)  !E(X:X  £  T__(5)  )  -  d|  <  j 

Next  choose  A  such  that  |d|  >  A  implies  that 


(3.27)  ; u a  ( d )  -  E (X , X  £  T_ ( 5 ) )  '  < 
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which  we  can  do  as  a  result  of  (3.25).  Combining  (3.26) 
and  (3.27)  we  obtain  the  result  that  |6|  >  A  implies 
]u  (<5)-d[  <  n t  which  is  what  we  desired  to  prove.  M 

Proof  of  Lemma  3.1:  Without  loss  of  generality  we  may 
assume  that  6  lies  on  the  positive  x^  axis  and  that 
1 5  |  >  2a.  Then  we  may  write  d  =  <5-a,  using  the  convention  that 
5  (or  d)  can  mean  either  the  vector  or  its  length 
(equivalently,  its  component) ,  depending  on  the 
context.  Equation  (3.21)  is  geometrically  obvious,  so 
we  shall  only  prove  (3.20).  Let  V(6,e)  be  the  volume  of 
T£(6)  and  observe  that 

(3.28)  P(Te (5) )  >  P(Te/2(6)  ) 

>  (2 tt)“  /2exp  (-  j(d  +  e/2)2)V(<5,|) 

and 

(3.29)  P ( U  _  (  5 ) )  <  (2t)“  7  2  exp  (-  | (d+c ) 2 ) ( a ) 

where  V^(a),  the  volume  of  S^,  is  given  in  (Al.5). 

It  is  easy  to  show  that  a  sphere  with  radius  c/4  centered 
at  d  + c/2  is  contained  in  T_(o).  It  then  follows  that 
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(3.30)  V(6,e)  >  Vk(e/4)  =  0(ek) 


Using  (3.28),  (3.29),  and  (3.30)  we  may  now  prove  (3.20): 

P(U£(S))  exp  [-  j(d + e) 2lVk(a) 

P(Tz(5})  "  exp  [-  |(d  +  c/2)2}  v(fi,e/2) 

1  3  2  ^ ^ 

<  exp  [-  j(de  +  Te  -  0  as  6  -  » 

since  d  -*■  00  as  6  -*■  °°.  ® 

The  next  theorem  shows  what  happens  in  the  one 
dimensional  case  when  the  S  (<5)'s  go  out  to  infinity 

a 

in  such  a  way  that  the  probability  they  contain  goes  to 
a  limit  which  is  non-zero.  We  shall  write  0 (x)  and  0 (x) 
for  the  N ( 0 , 1 )  density  and  cdf,  respectively. 


Theorem  3.5:  Suppose  X  v  n(0,1).  Then  if  Pa(5)  -  p 
as  5  -  *>  for  some  p  such  that  0  <  p  <  1 ,  it  follows 
that 


(3.31) 


* 


i>(5  "  i  i-c) ; 


Proof :  Since 
which  implies 
Therefore,  as 


f  -  *>,  J>(f+a)  -  1.  Hence,  f>  ( 5 — a )  -  1-p, 
that  i-a  -  ?  J'(l-p).  Of  course,  $  +  a  -  30 . 

5  -  » 
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x<p  (x)  dx 


(3.32)  Ua(6)  =  p-fjj 

Si 


®a(5)  ^  ^~1(I-P) 


But  since  /  x$(x)dx  =  *  ( c ) ,  we  conclude  that 


u  (5)  -  p"1ct)($“1(l-p)  )  .  * 

CL 


We  may  approximate  the  limiting  conditional  expectation 
of  X  in  (3.31)  in  a  simple  fashion.  In  Woodroofe  [13,  p.  97] 
it  is  shown  that  if  x  >  0, 


(3.33)  1  -  *(x)  <  I  d  +  x”2)  (1  "  *<x)) . 

By  setting  x  *  $  ^(1-p)  and  applying  (3.33)  it  is  easy 
to  derive  that  if  p  <  1/2, 

(3.34)  0  1  P  (^hliELt  .  s-l(1_p)  <  *  -  • 

P  $  (1-p) 

Hence,  as  p  —  0,  p  d  ^"(1-p))  ~  $  (l-p)  0-  Some 

numerical  results  are  exhibited  in  Table  3.1.  Observe 
that  when  o  —  0,3 ,  one  iteration  of  X.&T  using  the  worst 
possible  starting  value  will  still  lead  to  an  estimate 
with  expectation  0.35.  Of  course  this  comment  is  based 
on  the  assumption  that  the  observations  are  from  a  uni¬ 
variate  normal  distribution. 
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The  rest  of  this  chapter  is  devoted  to  an  analysis 
of  the  first  order  asymptotics  of  IET ;  more  particularly, 
we  shall  be  interested  in  obtaining  the  asymptof'c  distribu¬ 
tion  of  what  we  shall  call  the  stationary  point  of  this  algor¬ 
ithm.  Let  p  =  (b,v,B)  be  the  parameter  specifying  the  ellip¬ 
soid. 


(3.35) 


E  ( 4> )  =  (y  :  (y-v)  '  B-1  (y-v)  <b2} 


where  b  is  a  scalar,  v  is  a  k-vector,  and  B  is  a  symmetric 

positive  definite  k  x  k  matrix  with  tr B  =  k.  We  can  represent 

the  operation  of  IET  as  a  sequence  with  = 

and  =  T  ( <£  ^  )  where  the  operator  T  is  a  function 

n  n 


of  the  sample  (of  size  n)  and  is  defined  by 


(3.36) 


~  (m+1) 


:  3 . 37) 


(m+1) 


-p— T  /  y  dH  , 

[npj  1  *  n 

E 

m 

f  .  ~ (m+1)  .  ,  ~ (m+1)  ,  ' 

/  (y-v  )  (y-v  )  dH 

*  n 
E 

m 

k-’-tr  /  (v-.y^11)  (v-'3lm+1))dH 

__  * 

^  m 


n 


and  b  ^nH’^  is  the  smallest  value  of  b  for  which 


(3.38) 


LHEJ  =  r 


dH 


"m+1 


1 


■"gg«! 
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1/2 

where  Hn  is  the  empirical  cdf  of  Y  =  u  +  A  X,  which 

has  an  ellipsoidal  distribution  with  mean  yj  and 

covariance  %  =  (2ttc^)  We  define  the  "true  <J>"  to 

be  -t>0  =  (b0,v0,B0)  where  \jq  =  u,  Bq  =  k//(tr  %)  ,  and 

* 

bg  is  determined  implicitly  by  Pr  (Y  e  E  (<£g))  =  P- 


Definition  3.1:  A  local  stationary  point  is  a  sequence 

of  random  points  $  =  (b  ,\>  ,  B  )  such  that 

- e -  n  n  n  n  - 

-1/2 

$n  -  4>q  =  °p(n  )  and  satisfying 


(3.39) 


n 


"  W 


-1/2 
+  o  (n  /4i) 
P 


We  shall  say  that  there  is  an  essentially  unique  local 

-V  ^  I 

stationary  point  if  whenever  and  <t>n  are  local 

~  ~  '  -1/2 

stationary  points,  -  4>n  =  op(n  )• 

Because  of  the  invariance  of  IET  it  will  only  be 
necessary  to  consider  the  special  case  where  the  distri¬ 
bution  is  spherically  symmetric  (i.e.  u  =  0  and  A  =  I,). 
It  will  be  convenient  to  have  special  notation  available 
for  this  situation.  Let  9  =  be  the  parameter 

specifying  the  ellipsoid 

=  x  :  (x-i)  '  (I^+e)  1(x-6)  _<  (aQ  +  \)2} 


(3.40)  E  ( 9 ) 


HHS5! 
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where  X  is  a  scalar,  6  is  a  k-vector,  and  e  is  a 

symmetric  k  x  k  matrix  with  tr  e  =  0.  Here  the  "true  9" 

2  -1 

is  just  0Q  =  (0,0,0)  and  aQ  =  Fk  (p) .  In  this  case, 
the  random  variable  in  question  is  X  and  f,  F,  and  Fn 
are  respectively  its  density,  cdf,  and  empirical  cdf. 

Our  next  theorem,  a  generalization  of  Theorem  3.3, 
will  play  a  crucial  role  in  the  derivation  of  the 
asymptotic  distribution  of  what  will  turn  out  to  be  the 
essentially  unique  local  stationary  point.  We  define 


(3.41) 


2  2 

01  =  X  + 


k  0  k  k 

I  «•  ♦  l  l 

i=l  i=l  j=l 


'13 


Theorem  3.6:  If  g  is  differentiable  in  a  neighborhood  of 

2  , 
aQ,  then  as  |9|  -*■  0, 


(3.42)  /  f  (x)dx  =  F.  (a^)  +  2anf,  (a^)X  +  o  (|e|  ) 

E(9)  *  °  °  k  ° 

®  v  j.  2  9  i 

(3.43)  /  x f(x)dx  =  [j—)  (2£k+2(aQ))5  +  °(^:) 

C  (  v  )  K 


and 
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(3.44)  /  xx ' f (x)dx  = 

E(0) 


+  (777T^)  (2fk+4(a0))£  +  °  M  9  I  )  • 

(  2tt  )  ck 

We  shall  find  it  useful  to  derive  a  lemma  before  embarking 
upon  the  proof  of  this  result.  Set  a  =  a^  +  X  and 

3c 

suppose  that  h  is  a  vector  valued  function  defined  on  R  . 
We  shall  write  Dh(x)  for  the  derivative  matrix  of  h  . 

Lemma  3.2:  If  h  is  differentiable,  as  |  (5,e)  |  -*•  0, 

(3.45)  /  h(x)  f  (x)dx  =  A  +  B  +  C  +  o  ( |  (.<$,£)  |  ) 

E(0) 

where 

(3.46)  A  =  /  h  (x)  c^g  (x ' x)  dx 

x'x<a2 

(3.47)  B  =  /  [Dh(x)](ex/2  +  «5 )  c^g  (x '  x)  dx 


<’3.43)  C  =  /  h(x)(x'-;x  +  2x '  o )  c'^g  '  (x  '  x)  dx . 

,  2  * 
x  x  <  a 


,k+2. _  ,  2 . _ 

(  2-ttg^ 5  Fk+2  (a0}  Xk 


‘2fir"2a0£k+2(^)Ik)X 

k 


■k 
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Proof  of  Lemma  3.2:  We  change  variables  by  setting 

i  1 

2  -  /2 

y  =  (1^  +  e)  (x-5),  where  (1^  +  e)  /  is  chosen 

to  be  symmetric.  Then,  the  range  of  integration  in  (3.45) 

2 

is  the  sphere  {y  :  y’y  <_  a  } .  Since  tr  (e)  =  0,  the 
Jacobian  determinant  is 

(3.49)  l|p|  -  1  +  o(jei) 

Furthermore , 

(3.50)  x  =  (Ik  +  e/2) y  +  6  +  o(|ej) 

and 


(3.51)  x'x  *  y'y  +  y '  ey  +  2y '  <$  +  o  (.  |  (.5  ,  e )  f )  , 


Now,  by  recalling  that  f  (x)  =  c^gfx’x);  substituting 
(3.49),  (3.50),  and  (3.51)  in  the  left  hand  side  of  (3.45); 
expanding  h  and  g  about  y  and  y'y,  respectively; 
and  keeping  only  those  terms  which  are  zero  or  first  order 
in  (5,c),  we  obtain  the  lemma.  I 


Proof  of  Theorem  3.6: 
we  simply  apply  Lemma 
and  .  ,  r 


To  derive  (3.42),  (3.43)  and  (3.44) 

3.2  with  h(x)  set  equal  to  1,  x. 
The  final  step  is  to  expand  the 


espectively . 


results  thereby  obtained  (as  functions  of  a)  around  a^. 

In  all  cases  we  shall  make  heavy  use  of  spherical  symmetry 

and  Lemma  Al.l.  For  convenience  we  shall  write  S  for 

2 

the  region  of  integration  lx  :  x'x  <_  a  }.  When  h  =  1, 

2 

we  find  that  A  =  F^(a  )  and  since  Dh  =  0,  B  =  0. 

Observe  that  the  formula  for  C  may  be  simplified  to 

C=  (/  x^g'  (x'x))  (Seii)  , 

S 

which  implies  that  C  =  0  since  tr  (e)  =  0. 

Equation  (3.42)  then  follows  from 

/  f  (x)  dx  =  Fv(  (a-.+A)  2)  +  o(I (5,e) |) . 

E  (9 )  0 

When  h(x)  =  x,  it  is  obvious  that  A  =  0.  Since  Dh(x)  =  1^, 

2 

we  obtain  immediately  that  B  =  F^(a  )6.  The  equation  for  C 
is 

C  =  /(xx'ex  +  2xx'  <S)c.  1g'  (x'x)dx, 

S 

but,  by  spherical  symmetry,  the  integral  of  xx '  sxc^g '  (x '  x) 
vanishes  and  the  integral  of  the  remaining  term  reduces 
to 

C  =  [ /  x^c'^g' (x'x)dx] (2o) . 

S  1  * 


(3.52) 
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As  a  result  of  (Al.13)  and  (A1.19),  we  may  write  (3.52)  as 


c  =l<55^)l2£k«<a2))'  F*,a2>]5- 


Therefore,  adding  A,  B,  and  C  we  find 


/  xf  (x)dx  =  (5^)  (2f.  -(a *))&  +  o(l  (5, el  |) 
E  (9)  ^  ck  K 


from  which  equation  (3.43)  may  be  derived  by  expanding 

around  aQ.  Finally  we  shall  obtain  (3.44)  by  considering 

two  cases:  h(x)  =  x.x.  for  i  <  j  and  for  i  =  j. 

1  3 

First,  when  i  <  j,  it  is  easy  to  see  that  A  =  0.  We 
find  that  when  we  drop  terms  that  vanish  because  of 
spherical  symmetry  and  the  form  of  Dh  (Dh  has  x^  as 
its  ith  coordinate  and  x^  as  its  jth  coordinate,  with 
the  other  coordinates  equal  to  0) , 


(3.53)  B  =  / (1/2)  (x^e .  .  +  x^e . . ) c.  1g (x ' x) dx  . 
2  J  1 J  1  J  -1  * 


By  the  symmetry  of  j;  and  (Al.8)  and  (Al.17),  equation  (3.53) 
leads  to 


(3.d4)  3  -  (2~ck)  c]<+2Fk+2(a  )_:ij 


Again  keeping  only  nonvanishing  terms,  we  find 
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C  =  /  xjxj(eij  +  e ji)ckig' (x'x)dx, 


I (v ' 


which  simplifies  to 


(3.55)  c  =  2[(  ;;fr-)fk+4(a2) 
(  2  TT  )  Ck 


Ck+2  Fk+2(a  jj 
Zttc^  2  J  ij 


upon  the  application  of  (Al.16)  and  (Al.20).  But  then, 
using  (3.54)  and  (3.55)  we  derive 


(3.56)  /  x.x  •  f  (x)dx  =  ( - 5iI^“^2fk+4  (a2)  }  £ij  for  1  <  j 

r  /  a  \  *  J  f  0  it  \  r' 


E(9) 


(  2 TT  )  C. 


Taking  the  next  case,  i  *  j ,  or  equivalently  h(x)  xj/ 
we  find 


.  _  ,  k+2  ,2 

A  "  2TTCk)  Fk+2  (  } 


and 


by  making  use  of  (Al.3)  and  (Al.17).  Dropping  terms  that 
vanish  we  may  write 


C  =  f  x2(  T  x?e,  ,)clV  (x'x)dx. 
S  1  1=1  1  U  x 


Then,  using  M4 ,  defined  in  (Al.20),  and  equations  (A1.15) 
and  ( Al . 1 6 )  , 


C  =  (3M4leu  +  M4(  l  Hl) 

Ml 


But  since  tr  e  =  0,  it  follows  that  Y  =  -  e..;  hence 

L-  ■  £  £  1  1 


Mi 


li 


Ck+4  2  Ck-H2  ^k+2  ^ 

c  =  tl^r)fk+4<a  1 ' 2 — i«ii- 


Therefore,  summing  A,  B,  and  C  we  find 


(3.57)  ^9)xjf(x)dx-  (!^,Fk+2(a2)  +  (-!i±f-)(2fk+4(a-;))e 


Combining  (3.56)  and (3. 57)  we  obtain 


c  c 

/  xx'f(x)dx  =  (-|±2)F  (a2)Ik+(-^|— )(2f  (a2))e 

E(3)  2^ck  k+2  *  (2tt)  c,  k+4 

from  which  (3.44)  follows  upon  expanding  around  a^.  1 

It  is  easy  to  see  that  as  a  result  of  (3.42)  and 
(3.43)  and  the  formula  for  b^(a)  given  in  (Al.4), 


(3.53)  E  (X '  X  £  E(3))  =  bk (aQ) 5  +o(!e;). 


Using  (3.42)  and  (3.44),  we  obtain  in  a  similar  way  that 
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c  F  (a  ) 

E(  (X-6)  (X-6)  '  |X  £  E(B>)  =  (-jffi  +  O  (  |  9  |  ) 

k  Fk(a0} 

which  we  may  rewrite  as 

(3.59)  E  t  (X-6 )  (X-5 )  *  |  X  £  E  ( 8 )  ]  =  c(k,p)-j^Ik  +  0(  I  9  |  ) 

k 

-1  Ck+2 

where  c(kfp)  =  Fk+2  (Fk  (P))/P-  (Note  that  'fire  Tk  is 

ck 

the  covariance  matrix  of  X.)  It  is  a  simple  matter  to 
extend  (3.58)  and  (3.59)  to  their  ellipsoidal  generalizations: 

(3.60)  E  [  Y  |  Y  £  E*('M]  =  U  +  bk(a0)(v-y)  +  o(|$-<fr0|) 
and 

(3.61)  E[(Y-v)  (Y-v)  ' |Y  £  E*(<j>)]  =  c(k,p  )t  +  0(  |  <p  -  <f>Q  |  )  . 

At  the  end  of  Appendix  1  it  is  shown  that  c(k,p)Z  is  the 
covariance  of  the  truncated  ellipsoidal  distribution,  so 

(3.61)  reassures  us  that  when  u  is  near  ?q,  the  expected 
value  of  l  generated  by  IET  will  be  close  to  the  true 
truncated  covariance.  According  to  (3.60),  the  bias 
reduction  factor  plays  the  same  role  in  the  ellipsoidal 
case  that  it  played  in  Theorems  3.2  and  3.3,  which  is 
another  reassuring  result.  It  is  straightforward  to  write 
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down  the  first  order  terms  in  the  expansion  of  the  left 
hand  side  of  (3.61).  There  is  a  term  in  b-bQ  and  a  term 
in  B-BQ;  the  corresponding  coefficients  play  the  same 
sort  of  role  that  b^(ag)  does  in  the  location  case  in 
telling  us  how  one  iteration  of  IET  will  reduce  the  bias 
in  the  covariance  estimate,  on  the  average. 

Our  next  result  concerns  the  uniformity  of  convergence 

of  linear  functionals  of  the  empirical  distribution  function 

Fn  on  all  ellipsoids  E(9)  such  that  9  is  sufficiently 

close  to  0.  Note  that  | B |  ,  for  a  matrix  B  =  (b^),  will 
2  1/2 

denote  (1  Z  b  .)  7  .  Recall  that  6  =  (X,6,s)  where 
i  j  ^ 

tr  (e)  =  o. 

Lemma  3.3  -  (Uniformity):  Suppose  F  is  a  cdf  on  R 
with  a  bounded  density  and  Fn  is  the  corresponding 
empirical  cdf .  Let 

(3.62)  Vn  =  {9  :  !e;  <  Kn_1/2 + b} 
ar.d 

(3.63)  Dn(^  =  /  *<*>  (dF  ~dF)  -  /  h(x) (dF-dF). 

E ( 0 )  n 

Then  there  exists  a  b  -•  0  such  that  for  any  K  .  0  and  any 
bounded  scalar,  vector,  or  matrix  valued  h  , 
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]  3  -  64)  sup  | D  ( 9 ) 

9  £  V  n 
n 


Proof:  It  will  suffice  to  show  (3.64)  for  scalar  valued  h 

First  we  cover  vn  with  balls  of  radius  p  =  n  ^  a, 


m 

so  that  V  c  {-J  S  (9.)/  where  S  (9.)  =  (9  :  |  0—8 .  j  <  p} 
n  .  ,  p  i  ox  x 


i=l 


We  will  need  m  balls  where 


-  /2  +  ta  *  * 

_  -  „  ,n  °xk  _  T,  _k  (a+b) 

m  ~  Ki (  Y7s - }  '  V 


n 


-/  2  -  a 


and  k  is  the  dimension  of  the  9  space,  that  is 


k  -  1  +  k  +  k(k+l)/2  -  1  =  k(k+3)/2 


If  0  £  S  (9 . ) ,  then 
P  i 


E(9)AE(0i)  C  R ( 0i , p ) 


where  R(9^,o)  is  a  spherical  shell  in  x-space  with  thick¬ 
ness  bounded  by  Ko0.  Hence,  if  9  £  S,(9.), 


D  ( 9 )  -  D  ( 5 • )  i  <  /  lh(x)  |dFn  +  j 

n  n  i  —  i, ,  »  n 


h (x) |dF. 


R(9i,4 


r  ( 0 .  o; 

1 1 
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But  /  |h(x)  IdF  <_  K,  /  dF  <_  K.p  and 

R(e.  p)  r(9.  p) 


h(x)  dF  <  K,  J  dF  <  K.p  +  Kj.  ( p/n)  1/2Z 

R(0,  P)  n  "  3  R ( 3  •  P)  n  "  4  5 

1  /  X  , 


! 


where  Z  is  a  random  variable  with  mean  0  and  variance  1. 
Hence , 


sup  | D  ( 3 )  -  D  ( 3  .  ) |  <  K,(p  +  (p/n)1/2| Z| ) . 
3eSp(0i)  n  1 

—  I  2 

Since  Pr  (|z|  >_  n  )  £  n  ,  by  Chebyshev's  inequality  we 


may  conclude  that 


(3.65)  Pr  (  sup  |  D  ( 3 )  -  D_  ( 9  •  )  !  >  K,  (  p  +  ( p/n )  1/2r\~l )  )  <  n2 
9eSp  ( 9 i )  n  n  1  b 

Since  Dn(3^)  has  expectation  0, 

Var  D  (9  . )  <  n  ^  /  h2  (xj  dF  <_  K_n  ^  j  3  .  j  .  Hence , 

E(9t)AE(0)  1 

Var  £  Ksn~ln~1/2  + b  =  Kgn'  /2  + b  for 

i  =  l....,m.  But  then,  by  Chebychev's  inequality, 


(3.66)  Pr  ( 1 Dn(ii)  ;  >  Kn”  /2  “  a)  £  Kgn"  /2  +  2a  +  b 


where  a  >  0.  We  are  interested  in  showing  that  for  some 


a  > 


0, 
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(3.67)  Pr  (sup  |Dn(9)|  >  Kn“  /2  "  a)  =o(l). 


9eV 


Now  the  left  hand  side  of  (3.67)  may  be  bounded  by 


®  _l/9_, 

l  Pr  (  sup  | D  (0)[  >  Kn  '  ).  In  addition, 


i=l  9eS  (9  . 

p  i 


(3.68)  Pr  (  sup  | D  (9)  j  >  Kn 

9  e  S  ( 9  .  ) 

P  1 


>  Kn"  /2  "  * 


}  1  Pli  +  P2i 


where  P^^  is  the  left  hand  side  of  (3.66)  and 


(3.69)  P  .  =  Pr  (  sup  | D  ( 9 )  -  D_ ( 9 . ) |  >  Kn"  /2 *  a) 

21  9eS  (9 . )  n  1 

0  i 


But  if  we  choose  n  such  that 


V5 


♦  <  Krr1/2^ 


i.e.  if  we  take  n  =  Kinn^a//2^  (1/4)^  then  equation  (3.65) 


'10 


implies 


(3.70) 


d  <  K  n 

“ 2 i  -  11 


a- (1/2) 


But  now,  combining  (3.65),  (3.56),  and  (3.70)  we  obtain 


m 

(3.71)  "  (P,. 

i=l 


P  \  ^  v  _k  (a+b)  ,  - ( 1/2 ) +b+2a  a- ( 1/2 ) 

—  ^12n  ^ 


ii 


If  we  take  a  =  b 


*  -1 

=  (4(k  +2))  ,  then  the  right  hand  side 

of  (3.71)  goes  to  0  as  n  -*■  which  implies  (3.67)  , 
which  is  equivalent  to  (3.64),  the  desired  result.  B 

It  is  interesting  to  note  that  the  a  and  b  we  needed 
in  the  preceding  proof  were  quite  small,  that  is 


a  =  b  = 


2 (k(k+3) 


We  are  now  ready  to  derive  the  asymptotic  distribution 
:ne  1< 
density. 


of  the  local  stationary  point  d>  when  Y  has  an  ellipsoidal 


Theorem  3.7:  There  is  an  essentially  unique  local  stationary 


point  o  =  (b  ,  v  ,  B  )  .  As  n  -*•  *>,  n 
n  n  n  n  — 

asymptotically  normal.  In  particular, 


1/2,1 


*n"V  — 


(3.72)  L(n1/2(v  -u))  *N(0,  C  (1C'P)  2Z) 

p(1-Dk(a0)) 

P roo f :  It  is  enough  to  prove  the  theorem  in  the  spherically 
symmetric  case  where  the  true  density  is  f (x) .  The  invariance 
cf  IET  leads  to  the  more  general  result.  Mote  that  in  the 
spherically  symmetric  situation  we  shall  study  in  this  proof. 


=  (b  ,  v  ,  B  )  is  sinraly  related  to 
n  n  n 


■  _  ~  ^  , 
n  n  n 


n 


,  :-2 

by  on  =  (a9 


:n)2'  'n 


-  ★  v.  ^ 

3n  =  I;<  +  -:n,  when  we  require  £  (;  )  =  E(l^). 


and 
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Then,  as  a  result  of  (3.39)  and  the  uniformity  lemma, 

( Lemma  3.3), 

(3.73)  5  =  p"1  /  xdF  +  p"1  /  XdF  -  p"1  /  xdF  .  o  (n“  /2). 

E(0)  E,8n,  B  (01  P 

If  we  write  1(E)  for  the  indicator  function  of  the  event  E, 
and  let 


(3.74)  W,  =  n_1EX.l(X,  e  E(0))  =  /  xdF 

1  E  ( 0 )  n 


then  upon  making  use  of  equations  (3.43)  of  Theorem  3.6 
and  (A1.4)  we  find  that  (3.73)  may  be  rewritten  as 


w. 


(3.75) 


o  = 


P(1  - 


bk(a0)) 


°p(n-1/2) 


Now,  equation  (3.75)  gives  an  explicit  formula  for  6^. 

We  may  apply  the  central  limit  theorem  to  W^  and  obtain, 
as  a  consequence,  the  asymptotic  distribution  of  3^. 
Because  of  (A1.8)  and  (A1.17) 


(3.76)  unl/\)  -SCO,  <!§§?> rk+2uj>v. 

k 

Using  the  definition  of  c(k,p)  in  (A1.23),  and  the 

1/2 

transformation  Y  =u  +  A  X,  (3.76)  immediately  leads 
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to  (3.72).  In  a  similar  fashion  we  may  obtain  explicit 
formulas  for  X^  and  s  ,  analogous  to  (3.75),  by  making 
use  of  uniformity  and  (3.42)  and  (3.44)  together  with 
(3.39).  So  if  we  let  W3  =  n  ^E1(X^  e  E( 0) )  ,  then 


(3.77) 


X  = 
n 


p  -  W 


2a0fk(a$> 


§7  +  Vn"1/2>- 


Since  L  (W_  “  p ))  -*■  N(0,  p(l-p)),  it  follows  that 


(.(n1/2X  )  -  N  ( 0  ,  -  P(I“P)2 — j) 

<2a0fk(a0}) 


We  can  also  obtain  an  explicit  formula  for  as  a  function 

gf  W3  =  n~1SXiX^l(Xi  e  E(0))  and  Xn: 


(3.78) 


*k  +  £n  =  r^i 


-  +  op  (n  '‘) 

k  tr  (D)  p 


where 


D  *  a3  *  l2TC:<>'lck*2(2aofi<(ao!)I:<l 


*  <<2l)2=k)'1  =  k.4(2£k+4U?)>=n- 


3y  computing  tr(D),  which  would  depend  on  \n  and 

z  ,  we  could  use  (3.73)  to  derive  an  explicit 


but  not  on 
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l/2~ 

formula  for  £  .  Again,  n  en  will  have  an  asymptotically 

normal  distribution  but  with  a  very  messy  formula  for  its 

covariance.  (In  Appendix  1,  Lemma  Al.l  provides  the 

formulas  which  would  be  needed  to  actually  compute  that 

covariance  matrix.)  From  what  we  have  shown  so  far,  it 

follows  that  9  itself,  and  hence  $  ,  is  asymptotically 
n  n 

normal.  We  have  seen  that  if  <j>n  is  a  local  stationary 
point,  then  §n  must  satisfy  (3.75),  (3.77),  and  (3.78). 

But  since  those  three  equations  themselves  yield  an  explicit 
formula  for  9^  such  that  the  corresponding  <f>n  satisfies 
(3.39),  we  have  proved  the  theorem.  I 

We  conjecture  that  if  IET  is  allowed  to  iterate  until 
it  reaches  a  stopping  point,  that  is,  until  for  some  mQ , 

„  (mQ)  ~(riV 

i  =  T  (cj>  )  ,  then  first ,  there  will  be  such  an  mQ 

(with  probability  approaching  one)  and  second,  this  stopping  point 

-  /2 

will  be  in  a  °^(n  )  neighborhood  of  the  local  stationary 

point  whose  asymptotic  distribution  we  have  just  derived. 

If  our  conjecture  is  true,  then  we  have  just  obtained  the 

asymptotic  distribution  of  the  stopping  point  of  IZT.  Were 

we  to  prove  that  IET  does  halt  with  probability  approaching  I ,  it  would 

then  be  enough  to  show  that  there  can  be  no  stationary 

coint  d  such  that  i>  -  sn  is  biager  in  order  of 

magnitude  than  o  (n 

tr 


) ,  since  a  stopping  point  is 


certainly  a  stationary  point. 


It  is  interesting  to  note  that  if  Y  ^  N (u , / ) 
the  result  in  Theorem  3.7  becomes 


‘■(n1/2('Vu,)  -  p-(l-bk(a0) 


■7) 


since  c(k,p)  =  1  -  b^(a0)  for  this  distribution. 


then 


Chapter  4 


Monte  Carlo  Analysis 

In  this  chapter  we  will  be  concerned  with  the  question 
of  how  the  presence  of  outliers  and/or  a  small  (infrequent) 
subpopulation  influences  the  performance  of  IET  when  it 
is  used  to  estimate  the  mean  of  the  larger  (more  common) 
population  present  in  the  sample.  More  specifically, 
we  shall,  in  each  case  considered  here,  generate  a  random 
sample  composed  of  three  kinds  of  data: 

n.^  N(u1,  X1)  observations 

n2  N(u2'  )  observations 

and 

2 

n^  N  ( y ^ ,  0  X )  observations 

2 

where  n2,  n^  <<  n^  and  a  >>  1.  Then  we  shall  use  IET, 
as  defined  in  chapter  1  to  estimate  u ^ . 

Given  this  setup,  how  shall  we  study  the  performance 
of  IET?  There  are  a  considerable  number  of  parameters 
which  must  be  specified,  some  involving  the  algorithm 
and  the  others  involving  the  simulated  data,  before  the 
above  process  is  well  defined.  To  specify  IET,  one  must 
select  starting  values  for  the  mean  and  covariance  as  well 


as  a  sequence  of  proportions  or  p's,  while,  on  the  other 

hand,  before  generating  the  data  one  must  choose 

2 

y 2 /  X-^i  %2'  a  '  ni'  n2'  n3'  anc^'  implicitly,  the 

dimensionality  k,  of  the  problem.  Presumably  our  study 
should  consist  of  setting  these  parameters  at  a  sufficiently 
large  number  of  values  to  encompass  the  range  of  possibili¬ 
ties  likely  to  be  encountered  in  practice.  Unfortunately, 
there  are  too  many  parameters  for  such  a  comprehensive 
investigation  to  be  feasible.  Therefore  we  shall  content 
ourselves  with  the  examination  of  a  relatively  small  number 
of  cases,  in  the  hope  that  we  shall  still  get  a  feeling 
for  the  important  characteristics  of  the  behavior  of  IET. 

To  begin  our  study  we  now  assume,  without  loss  of 

generality,  that  =  0  and  that  ^  always  lies  on  the 

positive  x-^  axis.  Furthermore  we  set  =  1^  and 

always  take  the  starting  values  for  IET  to  be  the  observed 

mean  and  covariance  of  the  entire  sample,  which  consists  of 

n  =  n^  +  n2  +  n^  observations.  Then,  to  proceed  we  need 

only  choose  a  single  positive  number  for  j ^  >  the  dimen- 
2 

sionality  k,  3  ,  n^,  n^,  n^  and  the  sequence  of  p's. 

First  we  consider  an  example.  Set  k  =  3,  =  5, 

n^  =  100,  n^  =  20,  n,  =  0  and  let  the  sequence  of 
proportions  be  0.5,  0.6,  0.7.  The  overall  mean  of  the 
random  sample  generated  according  to  these  specifications 


-55- 


is  (0.88,  0.05,  -0.06)'.  The  three  successive  estimates 
of  y-j,  are 

(0.06,  0.06,  -0.16)  ' 

(-0.07,  0.05,  -0.08)  ' 

(-0.06,  0.06,  -0.07)  '  . 

Incidentally,  the  final  correlation  matrix  is 
f  1  -0.05 


Though  the  general  behavior  of  IET  in  this  example  is 
typical  of  its  behavior  in  many  other  examples  with  similar 
parameter  settings,  it  would  be  nice,  nevertheless,  to  have 
some  quantitative  information  about  its  performance.  The 
rest  of  this  chapter  is  devoted  to  the  presentation  and 
interpretation  of  this  kind  of  information. 

Our  procedure  is  as  follows.  Generate  a  random 
sample  according  to  the  appropriate  parameters.  Apply 
IET,  using  the  sequence  of  proportions  p  , ...jp^  to 

obtain  the  sequence  of  estimates  u^,  u  ^  ,  ...  ,  u  ^  , 

~  ( 0 )  — 

where  -  is  the  overall  mean,  x.  Compute  and  save 


0.04  \ 
0.09  ! 

i  i 


I 


EH 


■ 
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ef^  =  k  /2|u(l) ( /  where  the  superscript  1  indicates  that 
this  is  the  first  Monte  Carlo  trial.  Repeat  this  process 
for  each  of  s  samples  with  the  same  parameter  settings. 
Report  the  means  and  standard  errors  of  the  e|^  's  in 
the  form 


m 


(a  )  (a  )  ...  (a  ) 


m 


Of  course, 


e.  *  s"1  f  e  f 

1  j=l  1 


(j) 


and 


a 2  =  s-i;2  -“I.-  --1 

—  e 

ei  1  3 


=  s  x (s-1)  ±  l  (e f ^  -  e  . )  2  . 
i  i=l  1  1 


Now  suppose  a  sample  has  no  contamination, 
i.e.  n2  =  =  0.  Then  x  v  N(0,n  .  Hence 


,  X  i  ''  n 


,  .  .  Now  let 


(30 


:  4 .  i) 


mk  =  ( 2/k) 


1  /2  :((k^l)/2! 


r (k/2) 


Then  a  simple  calculation  reveals  that 
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/2  2 
E(x(k))  =  k  mk  and  Var(x(k))  =  Ml-n^).  It  then 

follows  that,  in  this  case, 


(4.2) 


E(e0) 


m.  n 
k 


and 


(4.3) 


Var ( e  Q )  =  (l-mk)n-1 


since  e,,  =  k 


=  i  y  I 


(nk) 


-1/2. 


(k)  ‘ 


We  list  some  numerical  values  for  mk  and  l-mk 

in  Table  4.1.  Incidentally,  as  k  =°,  mk  -*•  1  and 
2  -1 

1-m.  ^  (2k)  .  These  last  two  facts  may  be  derived  by 

K  1 


/2  , v”lv  ^ 


observing  that  L (k  /  (k  x 
and  applying  the  6-method. 


(k) 


1))  -*■  N  (0 , 2 )  as  k  -*■  00 


The  point  of  the  preceding  paragraph  was  to  provide 

a  benchmark  for  determining  when  e^  is  big.  The  best 

one  can  hope  to  do  is  to  have  an  average  error  e^ 

-1/2 

approximately  equal  to  mkn,  if  there  are  n^  observa¬ 

tions  of  the  major  population  (as  long  as  the  minor 
population  isn't  too  close  to  the  major  one).  For 

example,  if  n-,  =  100  and  k  =  2,  then 

1 

_  L  /  ^ 

mkn  '  =0.09.  Similarly  we  will  expect  the  estimate 

of  the  standard  error  to  be  approximately 

e  . 
i 
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(0.215  x  io'4)  /2  =  5  x  10“3  if  s  =  100  and  if  Pi 
is  such  that  about  100  observations  were  used  in  computing 
the  ith  estimate  u ^  (and  if  not  too  many  outliers 
are  included) . 

Now  we  are  ready  to  report  some  results.  We  shall 
describe  and  interpret  5  simulations,  to  be  referred  to 
as  simulations  A,  B,  C,  D,  E,  whose  results  are  recorded 
in  Tables  4.2-4. 6  at  the  end  of  this  paper.  In  simulations 
A,  B,  and  C  we  will  set  k  =  2 ,  ^  =  100,  n.2  -  10,  n^  =  0, 
and  s  =  100  (=  number  of  samples) .  These  three  simulations 
differ  only  in  regard  to  the  sequence  of  proportions  used. 

In  A,  p]_  =  p2  =  p3  =  p4  =  °*5'  in  B'  Pi  =  °-5'  p2  =  °-6' 

P^  =  0.7,  P4  =  0.8;  in  C,  P-^  *  ?2  -  P3  =  P4  *  0.8. 

Simulations  A,  B,  and  C  are  intended  to  illustrate 
the  effect  (on  IET  estimates)  of  the  position  of  the 
subpopulation  when  there  are  no  outliers  (n^  =  0) .  Un¬ 
fortunately,  the  standard  errors,  in  parentheses,  are 
often  comparable  in  size  to  the  differences  between 
various  e^'s.  (To  reduce  the  standard  errors  to  about 
0.001  would  require  at  least  25  times  the  number  of  samples 
used  here:  2500  samples  of  110  observations!)  Nevertheless 
we  may  derive  some  general  conclusions  of  interest. 

In  simulation  A,  the  presence  of  a  subpopulation 
causes  one  to  do  somewhat  worse  than  one  would  otherwise 
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do.  Simulations  B  and  C  suggest  that  the  estimation  problem 
is  hardest  when  the  subpopulation  is  1  or  2  standard 
deviations  away.  The  use  of  an  increasing  sequence  of 
proportions  seems  to  slow  down  convergence  (as  compared 
to  using  the  biggest  p  repeatedly) ,  but  not  to  affect 
e^  very  much. 

In  simulation  D  we  study  the  effects  of  having 
both  outliers  and  a  small  subpopulation  in  the  sample. 

To  simplify  the  presentation,  each  run  in  Table  4.5 

will  be  identified  by  a  parameter  vector: 

2 

(u2,  a  ,  P^,  P2,  P3,  p4)*  For  a 11  of  the  runs,  n1  =  100, 
n2  =  n3  =  10,  s  =  100,  and  k  ~  2. 

The  final  average  error,  e4 ,  in  runs  2  through  7 
is  close  to  0.1  and  is  based  on  0.8(120)  =  96  observations. 

So  we  are  doing  about  as  well  as  we  could  hope  to  do.  The 
presence  of  outliers,  whether  with  variance  100  or  400, 
has  little  effect;  ultimately  IET  screens  out  most  of 
these  points.  Similarly,  the  final  error  is  not  influenced 
much  by  where,  exactly,  the  subpopulation  is  (as  long  as 
it  is  at  least  3  standard  deviations  away)  nor  by  whether 
or  not  an  increasing  sequence  of  proportions  is  used.  In 
practice,  naturally,  we  do  not  know  that  the  right  proportion 
is,  for  example,  0.8.  This  is  one  reason  why  an  increasing 
sequence  might  be  useful  -  one  imagines  gradually  raising  P 
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until  IET  begins  wandering  off,  a  sign  that  P  has  become 
too  big. 

Our  last  simulation,  simulation  E  in  Table  4.6,  involves 

four  dimensional  data.  Each  run  is  again  identified  by 

a  parameter  vector  (u2'  n2'  n3> >  but  in  all  cases 
2 

n^  =  100,  a  =  400  if  n^  >  0,  and  the  sequence  of  P’s 
is  0.5,  0.6,  0.7,  0.8.  The  results  here  are  similar  to 
those  that  have  gone  before,  but  note  that  the  final 
average  error  in  run  #4  is  0.147,  a  rather  large  value, 
presumably  a  result  of  the  fact  that  it  is  based  on 
0.8(130)  =  104  observations,  some  of  which  must,  of 
necessity,  be  outliers  or  belong  to  the  subpopulation. 

These  Monte  Carlo  results,  though  they  are  neither 
terribly  comprehensive  nor  terribly  accurate  (in  the  sense 
of  having  small  standard  errors)  do  reassure  us  that  IET 
behaves  basically  in  the  way  that  intuition  would  expect 
it  to.  We  have  found  that  IET  is  insensitive  to  outliers 
and  small  subpopulations  that  are  sufficiently  far  away, 
and  that  it  does  hone  in  on  the  major  cluster  we  are 
seeking.  These  results  are  not  directly  related  to  the 
theoretical  result  in  chapter  3  concerning  the  distribution 
of  the  stationary  point,  since  we  did  not  let  IET  iterate 
to  a  limit  but  rather  had  it  iterate  a  predecided  number 
of  times.  It  would  be  interesting,  however,  to  carry  out 


J 


if 
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another  Monte  Carlo  study  in  order  to  get  an  idea  of  what, 
on  the  one  hand,  the  small  sample  distribution  of  the 
stationary  point  looks  like  for  spherically  symmetric  ob¬ 
servations  and  how,  on  the  other,  it  depends  on  the  presence 
of  various  kinds  of  contamination. 

The  reader  may  wonder  why  we  chose  to  report  "absolute" 
errors  rather  than  squared  errors.  The  reason  is  that  when 
we  began  doing  simulations,  we  found  that  rather  large 
values  of  would  occur  with  rather  surprising  fre¬ 

quency.  We  were  afraid  that  these  occasional  large  values, 
if  squared,  would  have  an  undue  amount  of  influence.  Re¬ 
porting  absolute  errors  was  a  way  of  downweighting  that 
influence . 


•  A. 
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Chapter  5  -  Scale  Estimation 

As  we  pointed  out  in  chapter  1,  the  estimate  of 
covariance  yielded  by  IET,  the  final  7,  is  biased  towards 
0.  It  turns  out,  in  fact,  that  7  is  an  estimate  of  c7 
where  c  is  an  unknown  constant  between  0  and  1.  In 
essence,  then,  IET  provides  an  estimate  of  both  the  shape 
(the  relative  dimensions  of  the  axes)  and  the  orientation 
but  not  of  the  scale  of  the  ellipsoid  corresponding  to  7- 
It  is  for  this  reason  that  we  devote  this  chapter  to  scale 
estimation.  First  we  shall  describe  the  basic  device  with 
which  we  intend  to  obtain  an  unbiased  estimate  of  l.  Then 
we  shall  discuss  some  properties  of  this  method  and 
investigate  its  efficiency.  In  addition,  a  few  alternative 
scale  estimators  will  be  considered  and  compared. 

Let  us  suppose  that  the  conjecture  at  the  end  of 

Chapter  3  is  correct:  namely,  that  the  stopping  point  of 

~^/2 

IET  is  guaranteed  to  be  within  o  (n  /  )  of  the  stationary 

point  whose  distribution  we  have  derived.  Then  it  follows 

that  the  limiting  l  produced  by  IET  will,  as  n  -  *>, 

converge  to  the  population  covariance  of  the  truncated 

ellipsoidal  distribution  given  in  equation  (A1.22), 

2  2  2  -1 

c(k,p)7,  where  c(k,p)  =  F,^+2(a  )  /F^  ( a  )  and  a  =  (p)  . 

It  would  then  be  plausible  to  obtain  an  estimate  of  7  by 
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dividing  X  by  c(k,p) 

(5.1)  =  c(k,p)_1X 

(The  "U"  in  X^  is  supposed  to  indicate  that  it  is 
an  asymptotically  unbiased  estimate.) 

The  parameter  p,  which  appears  in  c(k,p),  represents 
the  proportion  of  points,  from  the  cluster  we  have  converged 
to  included  in  successive  IET  estimates,  and  not  the 
proportion  of  points  from  the  entire  sample,  which  was  the 
meaning  of  p  in  Chapter  2.  In  Chapter  3,  the  two 
proportions  coincided  because  there  was  only  one  cluster, 
but  typically  in  our  applications  there  will  always  be 
several  clusters.  So  it  is  necessary  to  estimate  p  in 
order  to  compute  X^j* 

One  might  object  that  if  it  is  necessary  to  "know" 
approximately  how  many  observations  are  in  the  cluster 
before  one  can  scale  up  X  to  get  an  unbiased  estimate, 
then  there  is  little  point  to  this  estimation  procedure, 
for  one  could  simply  use  all  of  the  points  thought  to 
belong  to  the  cluster  and  thereby  obtain  an  unbiased 
estimate  directly.  Though  there  is  some  justice  to  this 
criticism,  it  must  be  remarked  that  the  latter  estimate 
would  surely  be  much  less  robust,  as  the  observations 


with  the  largest  influence  on  it  are  precisely  those 

points  about  whose  correct  classification  we  are  least 

certain.  Earlier  we  suggested  one  way  to  approximate 

the  number  of  observations  in  a  cluster:  after  reaching 

a  stopping  point,  gradually  increase  the  number  of  points 

included  in  the  ellipsoid  until  IET  begins  to  move  away. 

Here  the  basic  idea  is  to  base  t  only  on  points  whose 

membership  in  the  cluster  is  reasonably  certain;  then, 

obtain  the  total  number  of  points  in  the  cluster  by 

throwing  in  points  which  are  likely  to  belong  but  about 

whose  classification  we  nevertheless  entertain  some  doubt. 

Another  possibility,  which  really  embodies  the  same  idea 

2 

is  to  plot  a  histogram  of  the  D^'s.  If  we  are  lucky,  there 
will  be  a  relatively  clean  cutoff  as  there  is  in  Figure  5.1. 
Unfortunately  such  cutoffs  are  rather  uncommon. 

One  consolation  is  that  it  is  easy  to  compute  the 
bias  introduced  by  using  the  wrong  p  in  the  equation  for 
7.7.  More  precisely,  if  one  uses  p'  instead  of  p,  then 

U 

asymptotically ,  7y  will  be  off  by  the  factor 
c (k, p ' ) /c (k , p) .  Indeed,  for  many  ellipsoidal  distributions 
of  interest,  c(k,p)  is  increasing  in  p  (in  particular 
for  the  normal  distribution:  see  Lemma  A2.1  and  equation 
(A2.9)).  Hence  if  one  felt  that  p  lay  in  a  certain 
interval,  then  c(k,p)  would  lie  in  a  corresponding 
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interval  and  one  could  study  the  variation  in  with  p. 

Later  in  this  chapter  we  will  introduce  an  alternative 
estimator,  called  v^ ,  which  does  not  require  an  estimate 
of  p.  Unfortunately  we  shall  find  that  it  is  much  less 
efficient  than  a  close  relative  of  £  called  v^,  at 
least  when  the  data  are  normally  distributed. 

But  before  going  on  to  develop  our  ideas  further, 

perhaps  it  will  be  helpful  to  examine  some  numerical  values 

of  c(k,p)  (for  the  normal  case),  which  may  be  found  in 

Table  5.1.  When  k  =  1  and  we  include  25%  of  the  ob- 
~2  2 

servations,  a  is  only  3.3%  of  a  ,  on  the  average,  but 

if  k  =  7  and  25%  of  the  data  is  included,  X  is  42.5% 

of  what  it  should  be,  on  the  average.  Further  values  of 

c(k,p)  may  be  obtained  from  Table  A2.1,  after  making  use 

-1  1/2 

of  the  result  in  (A2.9),  that  c(k,p)  =  1  -  b^((F^(p))  ). 

Note  that  (A2.9)  is  not  true  for  all  ellipsoidal  families; 
it  does  hold,  however,  for  the  normal  one.  It  is  a 
remarkable  coincidence  that  the  bias  reducing  factor  in 
chapter  3,  b,  (a)  ,  is  related  in  this  way  to  the  covariance 

JS» 

of  the  truncated  normal. 

Lemma  A2.2  implies  that 


(5.2)  c ( k , p )  :  1 


1  /2 

(2/k)  ' 


?  (  5 


-1 

~ 


(p)  ) 


for  large  k 


4. 
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and 

,  ~  r  (  (k/2)+l)  2/k  2/k  - 

(5.3)  c  (k , p)  ~  -—£j2)  +1 - p  for  sma11  P 

Hence,  as  k  •+  °°,  c(k,p)  -*■  1  as  long  as  p  >  0.  Further¬ 
more,  c  ( 1 ,  p )  =  0(p2),  c  (2  ,p)  =  0(p),  and  c(7,p)  =  0(p2//7) 
as  p  -*■  0,  which  is  to  say  that  c(k,p)  -*-0  as  p  -*•  0 
more  and  more  slowly  as  k  increases.  This  fact  accounts 
for  the  results  in  the  above  discussion  when  p  =  0.25. 

We  would  like  to  investigate  the  efficiency  of  the 
estimator  Xy-  The  most  satisfying  approach  to  this 
question  would  be  to  define  a  loss  function,  L (X,Xy),  and 
compute  its  expected  value  as  n  -*•  30  when  the  data  belongs 
to  some  family  of  ellipsoidal  distributions.  Then  one 
would  compare  the  asymptotic  risk  to  the  corresponding 
value  for  the  optimal  procedure.  Unfortunately  this 
program  is  a  difficult  one  to  carry  out.  Therefore,  we 
shall  make  a  variety  of  simplifying  assumptions  so  as  to 
make  the  problem  more  tractable.  Later,  we  will  return 
to  comment  upon  how  much  we  have  lost  in  making  these 
assumptions . 

Henceforth  we  shall  assume  that  there  are  n  independent 
observations:  X^,...,Xn  v  N(1,vB),  where  B  is  a  known 

k  <  k  matrix  such  that  B |  =  1 .  We  shall  be  interested 


in  estimating  v,  which  we  shall  sometimes  refer  to  as  the 
scale  of  the  distribution.  Note  that  in  this  formulation 


there  is  no  contamination  and,  as  a  result,  no  ambiguity 
associated  with  "p".  It  is  convenient  to  introduce  some 


-1. 


further  notation  at  this  point.  Let  =  X^B  X^ 


and 


2  2 
note  tnat  Z i  n,  vy  ^  ;  then  define  =  1(Z^  £  t  ),  where 

1(E)  for  an  event  E  is  the  indicator  function  of  that 

event,  and  observe  that  is  a  Bernoulli  random  variable 

2  2-1 

with  parameter  p  =  F^(t  /v)  .  (As  before,  a  =  F^  (p) . ) 

We  will  often  refer  to  the  statistics  J  =  n  ^ZJ., 

JZ  =  n  1ZJiZi,  and  Z  =  n  ^ZZ^.  It  will  be  helpful  to 

write  p  =  J,  since  J  is  an  estimate  of  p,  and 

~2  -1  -  —  -1  * 
a  =  (p)  .  Finally  we  introduce  S  =  (nJ)  ZJLX^X^. 

Now  we  may  present  our  first  estimator  of  scale,  v. : 


(5.4) 


kc(k,p) 


The  first  natural  question  to  ask  is:  why  is  this  a 
plausible  estimate  of  v?  The  numerator  of  (5.4)  is 


essentially  v  multiplied  by  the  sample  mean  of  np 

2  2 

observations  of  a  y ,  truncated  at  a  .  On  the  other 

(k) 

2 

hand,  the  denominator  is  the  expectation  of  a  Xn 


~  2 


(k) 


truncated  at  a  .  This  latter  observation  follows  from 


(Al.7),  (A1.17),  (A2.3)  and  (Al.23).  Of  course  we  would 


prefer  to  use  the  expectation  of 


( Jc ) 


truncated  at  a 


2  2 

but  since  a  =  t  /v  depends  on  v,  we  use  the  estimate 
~  2 

a  instead.  Nevertheless,  v^  is  a  consistent  estimator 
of  v,  a  fact  which  can  easily  be  shown  by  applying  the 
weak  law  of  large  numbers  to  J  and  JZ  and  using  the 
continuity  of  and  b^. 


Since  is  computed  by  truncating  at  a  random 

point  and  we  are  interested  in  studying  simple  estimators 
that  are  similar  to  it,  one  might  well  ask  why  we  truncated 
at  a  fixed  point  in  computing  v^.  Indeed,  we  next  intro- 

-  I  ~ 

duce  v^,  an  estimator  which  is  identical  to  v^  except 
that  the  truncation  point  is  now  random  and  such  that 
exactly  [np]  observations  are  small  enough  to  be  included 
It  will  turn  out  however  (see  Theorem  5.6)  that  v^  and  v 
are  asymptotically  equivalent;  as  a  result  we  can  study 
whichever  one  is  more  convenient  and  that  one  is  v^. 

Before  giving  the  formula  for  v1 ,  it  will  be  convenient 
to  set  -  l(Zi  <  Z([np])).  Then, 


(5.5) 


J' Z/J'  _  i  i 

V1  kc(k?p)  [np]kc(k,p) 


Formula  (5.5)  is  simply  (5.4)  with  replaced  by 
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(with  the  change  that  since  p  is  now  "fixed"  by  the 
2  -1 

statistician,  a  =  (p)  xs  now  known) .  The  duality 

associated  with  either  fixing  t  to  get  v^,  or 

-  I 

fixing  p  to  get  appears  to  be  very  general.  For 

example,  the  "t-fixed"  estimators  v2 >  v^,  v4  to  be 
studied  shortly  all  have  "p-fixed"  counterparts 

^  ~  t  ~  y 

v2 ,  vy  v4.  We  believe  that  the  proof  of  Theorem  5.6, 

^  ^  t 

which  asserts  the  asymptotic  equivalence  of  v-^  and  v^ 

^  v  I 

as  well  as  of  v2  and  v2 ,  will  provide  some  insight  into 
this  phenomenon.  However,  we  conjecture  that  there  is  a 
theorem,  or  perhaps  a  metatheorem,  which  is  yet  to  be 
formulated  precisely,  that  would  make  clear  the  general 
nature  of  this  duality.  Such  a  theorem,  we  believe, 
would  for  example  obviate  the  need  for  working  with  order 
statistics  and  their  asymptotic  distributions  in  many  cases, 
as  in  the  derivation  of  the  asymptotic  distribution  of  the 
median  or  the  trimmed  mean. 

By  now  the  reader  must  be  wondering  why  we  are  so 

~ '  _ 

interested  in  v^  and  v^.  Let  S'  =  (nJ')  IJ\X^X^  and 

t 

observe  that  another  formula  for  v^  is 

~ '  =  k~1tr  ( S ' B~ 1 ) 

V1  c(k,p) 

Now  suppose  that  we  were  to  regard  u  and 


.to. 
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~  ~  i/k 

%/ | Z |  (obtained  from  IET)  as  "correct".  Then,  we 

could  subtract  off  u  from  each  of  the  n  observations, 

so  that  their  population  mean  would  be  zero,  and  set 

B  =  Z/|?|1/Ak.  It  would  follow  that,  since  S'  =  %, 

~  •  -1  ~  1/k 

we  can  write  v  =  c(k,p)  |7|  .  But  then 


Essentially  we  have  factored  Z^  into  two  parts  and  have 
chosen  to  study  the  variability  of  only  the  first,  or 
scale,  part. 

Before  going  on  to  compute  the  asymptotic  distribution 

of  v^,  we  shall  introduce  several  alternative  scale 

estimators.  It  is  appropriate  to  consider  first  the  MLE 

of  v  based  on  all  of  the  data  (i.e.  X,,...,X  ),  which 

1  n 

we  shall  call  Vg.  Using  the  fact  that  the  log-likelihood 

of  one  observation  is  l (v)  =  C  -  (k/2)log  v  -  (2v)  ^x'B  ^x, 

a  Simple  calculation  shows  that  the  Fisher  information 

2 

associated  with  Vg  is  Ig(v)  =  k/2v  .  It  follows  that 
(5.6)  L  (n1/2 (v0-v) )  -  N ( 0 ,  VQ)  as  n  -  » 


whe  re 
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(5.7)  VQ  =  2v2/k. 

The  formula  for  Iq(v)  is  c5u^te  reasonable  as  it  asserts 
that  the  information  in  k  1-dimensional  observations  is 
the  same  as  that  in  one  k-dimensional  observation. 

The  next  estimator,  V2,  we  will  refer  to  as  a 
"Bernoulli-type"  estimator  because  it  only  depends  on 
the  J^’s: 

(5.8)  v2  =  t2/F~1(p) . 

Actually,  v2  is  a  maximum  likelihood  estimator,  as  we  shall 
see  later.  The  "p-fixed"  version  of  v2  is 

(5-  9)  v2  =  Z ( [np] )/Fk  (p) ' 

We  discussed  earlier  how  in  any  application  of  IET, 

we  would  not  know  the  proportion  of  points,  p,  from  a  given 
cluster  included  in  the  final  ellipsoid.  The  next  estimator, 
v ^ ,  offers  the  possibility  of  surmounting  this  problem  in  a 
formal  way  (as  opposed  to  the  ad-hoc  ways  we  mentioned 
before) .  We  define  v^  to  be  the  MLE  of  v  using  the 
truncated  normal  density,  which  is 
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|=  ( 2nv)  _  /2{Fk(t2/v)  )-1exp  (-  -Uc'B^x) 
(5.10)  pt(x;v)|  for  x'B_1xlt2 

j=  0  otherwise 

based  of  course  on  the  M  observations  such  that 

=  1.  This  estimator  does  not  use  any  information  con¬ 
cerning  how  many  observations  had  =  0.  Of  course, 

-w  i 

v,  is  the  MLE  of  v  based  on  the  m  smallest  Z.'s, 

3  i 

where  now  m  is  not  random.  We  conjecture  but  shall  not 

-  I 

prove  that  the  limiting  distribution  of  v^  is  the  same 
as  that  of  v^.  Incidentally,  Cohen  extensively  investi¬ 
gated  estimation  problems  involving  the  truncated  normal 
distribution  in  the  1950's,  but  never,  as  far  as  we  can 
tell,  studied  the  ellipsoidally  truncated  normal.  See  for 
example  [4]  . 

Finally,  v4  is  the  MLE  for  v  based  on  observations 
of  the  censored  normal  distribution ,  which  is  to  say  that 
we  "see"  if  Jx  =  1  but  learn  only  that  =  0 

otherwise.  Therefore,  this  estimator  is  based  on  knowing 
how  many  observations  are  "missing",  in  contrast  to  v^ , 
which  was  based  only  on  the  ' s  with  =  1.  Actually, 

v_j ,  then,  makes  use  of  exactly  the  same  information  as 
did  v^  :  J  and  JZ ;  and  it  is  therefore  particularly 


3- 


1 

i 


appropriate  to  compare  then.  Again  we  conjecture  but  shall 
not  prove  that  v^  has  the  same  asymptotic  distribution 
as  v^,  which  is  the  MLE  of  v  based  on  observing  the  [np] 
smallest  Z^'s  and  knowing  that  there  were  n  observations 
altogether.  Incidentally,  the  likelihood  for  one  observation 
corresponding  to  v^  is 


(5.11) 


( 2ttv)  ~  //2exp  (-  j  — -)  if  x'B“iX  <_  t2 
J.  -  Fk(t2/v)  if  X’b‘1X  >  t2. 


Now  we  are  ready  to  derive  the  asymptotic  distributions 
of  v^  to  v4- 

Theorem  5.1:  As  n  -»•  ”, 


(5.12) 


L(n1/2(v]_  -  v)  ) 


N(0,Vx) 


where 


(5.13)  VL  - 


2  7  4 

- - ^[Fk(a“)  (1-F  (aZ)  )  ^ 

FC.ofa")  k“ 


k+2 


-2(1-E'k,a2|IFXt2«a2|X‘i¥lr-s.4la'>-Fk  +  2(a2>l 
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Proof:  v^  is  a  function  only  of  the  statistics  (J,JZ) . 

We  will  derive  the  joint  asymptotic  distribution  of  these 
two  random  quantities  and  obtain  the  asymptotic  distribution 
of  v^  by  applying  the  delta  method.  By  the  central  limit 
theorem. 


L  (n1/2 ( (J,JZ) 


(U1,U2) ' ) )  -  N(0,C) 


where 


'11 


'12 


C  = 


'12 


'22 


and  (u1,U2)’»  and  C  are,  respectively,  the  mean  and  co- 

2 

variance  of  (J.,JZ.)'.  Since  J-  is  Bernoulli,  y.  =  F.  (a  ), 
111  1  1 

2 

and  by  (A1.8),  u ^  =  vkFk+2 (a  ).  The  variance  of  is 

c11  =  F^(a2)(l  -  Fk(a2))  and  by  (A1.8), 
c12  =  vk(l  -  F^ (a2) ) F^+2 (a2) .  Finally,  by  (Al.8)  and 
(A1.9),  c 2 2  =  v2 [k (k+2) F^+4 (a2 )  -  k2F2  +  2{a")].  Using 
(A1.23),  we  may  rewrite  (5.4)  as 


V1  = 


JZ 


kF^2,FkL(J>> 


If  we  define  the  function 


h(u,,u-)  - - ; -  , 

kFk+2,Fk  <ul>> 

then  =  h(J,JZ)  and  v  =  h(u1,y2).  Applying  the  so 

called  delta  method  we  have  the  result  that  as  n  -*■  00 , 
L(/n(v1~v))  -*■  N ( 0 , )  where 

(5.14)  Vl=(g)'c(g) 


and  the  derivative  above  is  evaluated  at  u^  =  (u^/V^)* 
It  is  easy  to  compute  that 


3h 


ou1 


u 


fk+2(a  1 
Fk+2(a2|£k(a2> 


and 


3h  !  =  1 

3U2\t0>  kFk*2(a2) 


Then,  using  the  fact  that 
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a  straightforward  evaluation  of  (5.14)  reveals  that  (5.13) 
holds.  I 

Before  deriving  the  asymptotic  distributions  of  the 
remaining  estimators,  it  will  be  convenient  to  record  a 
simple  result  concerning  the  effect  of  a  reparameterization 
on  the  Fisher  information. 

Lemma  5.1:  Let  1-^(9)  and  I2(<?>)  be  the  Fisher  informations 
associated  with  f(x;0)  and  f(x?g(<J>)),  respectively.  If 
9  and  $  are  scalars  and  g  is  a  differentiable  strictly 
monotone  transformation,  then 


(5.15)  i2 (<D)  =  (g'  (<t>) )  ^(gU) 


Proof:  The  proof  is  a  simple  calculation.  M 


Next  we  derive  the  asymptotic  distribution  of  v^ 


Theorem  5.2-.  As _ n  -*•  », 


(5.16)  L (n1/2 (v-  -  v) )  -  N  ( 0  ,  V- 


where 


V2  .  V 


2  fkU2) 


4.2,  2. 
a  fk(a  ) 


(5.17) 
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Proof :  Note  that  ,  as  defined  in  (5.9) ,  is  the  MLE  of 

v  when  we  observe  J.  , . .  .  ,  J  ,  which  are  i.i.d.  Bernoulli 

l  n 

2 

random  variables  with  parameter  p  =  Fk(t  /v)  .  Using 

the  notation  of  Lemma  5.1,  I^(p)  =  [p(l— p) ]  ^ .  Further- 

dp  -1  2.  .  2, 

more,  since  ^v  =  '  v  a  f^(a  )  , 


I2(v)  =  ^-(a2fk(a2))2[Fk(a2)  (1-Fk(a2))]-1 


which 


is  equivalent  to  (5.17)  since  V2  =  I2 (v) 


-1 


If  there  are  m  observations  of  X  such  that 

X'B  ^X  £  t2,  then  using  (5.10),  the  log-likelihood 

~^/2 

expressed  as  a  function  of  a  =  tv  '  is  £(a)  where 


(5.18)  m~ 1  £ ( a )  =  C  +  k  log  a  -  |(tr  -(^B  1} ) a2  -  log  Fk(a2) 


d£ 

Then,  the  likelihood  equation,  -r—  =0,  is 

da 


(5.19)  k-1  Lp-ila2  +  bk(a)  =  1. 


If  we  let  be  the  solution  to  (5.19)  and  observe  that 

2  ~  2 

by  definition {  =  t  /a^,  then  we  may  write 


k~1tr  (SB~L)  =  k"1tr  (SB  L) 
1  -  bk(a3)  c  ( k  , F k  f a 2  )  ) 
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>  i  tinmam 


-nfeiitui 


an  equation  which  strongly  resembles  (5.5).  The  next 
theorem  shows  that  =  00  with  non-zero  probability  for 

a  finite  sample  size. 


Theorem  5.3: 


(5.20) 


(5.21) 


„  k  M>  ~ 

2  —  k+2  3 

t£ -(SB-1)  <  _k_  =  >  ~  =  tf_ 

k+2  3  -2  ' 

^  a3 


where  a,  is  the  unique  finite  positive  solution  to  the 


likelihood  equation  (5.19). 


To  prove  this  theorem  we  shall  need  a  lemma,  which 
is  proved  in  Appendix  3. 


Lemma  5.2:  The  function  s (a)  =  ka  (1  -  b^(a))  is 
monotone  decreasing  for  a  >  0.  As  a  0 ,  s(a)  -»■  k/(k+2) 
and  as  a  -►  * ,  s  (a)  0 . 


Proof  of  Theorem  5.3:  We  may  expand  F,,  (a  )  in  powers 

1  ■  in  -  ■ 

of  a  as  in  (A2.12)  to  obtain 


V*2)  ■  2rkX(1  -  +  x'CT1  *  o(a4»> 


Then , 
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log  Fk(a‘) 


a2  k 

const.  +  k  log  a  -  —  (j^J 


2  (k+4) (k+2 ) 2 


+  o  (a 


and  as  a  result  of  (5.18) 


(5.22)  m~ 1  £  ( a )  =  const.  +  ^-(£2  - 


a4 

T 


(k+4) (k+2) 


1 


+  o  (a  )  . 


Now  let  c  t  tr  (SB  and  observe  that 

(am)'1  ||  =  ka~2 (1  -  bk(a))  -  c2  =  s(a)  -  c2. 


If 


2  2 
c  k/k+2,  then  by  Lemma  5.2,  s(a)  -  c  <  0  Va  >  0  and 

therefore  ||  <0  V  a  >  0.  Hence,  the  maximum  value  of 


£  (a) 


is  achieved  at  a  =  0.  On  the  other  hand,  if 


2  2 
c  <  k/k+2 ,  then  there  is  a  unique  a  such  that  s  (a  )  =  c 

c  c 

by  Lemma  5.2.  Furthermore,  ac  is  the  unique  a  >  0  such 


dZ 

da 


that  —  =  0.  But,  by  (5.22),  1(a)  is  increasing  for  a 


near  0;  since  1(a)  -  -  *>  as  a  -*■  »,  there  is  at  least 

one  local  maximum  at  an  a  >  0.  We  may  now  conclude  that 

the  global  maximum  occurs  at  a  .  Of  course,  a,  =  a  .  I 

C  j  c 

-2  -1 

Why  is  it  that  when  t  tr  (SB  )  >_  k/k+2, 
v,  =  Suppose  that  X  has  a  uniform  distribution  in 
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the  ellipsoid  lx  :  x'B  <_  t2}  and  let  Y  =  B  ^2X. 

Then  Y'Y  =  X'B  ^X  and  Y  has  a  uniform  distribution 

in  S t  ( 0 )  ,  the  sphere  of  radius  t.  By  (A1.5),  the  volume 

k  k  /? 

of  this  sphere  is  d^t  ,  where  dk  =  tt  /r((k/2)+l), 

k-1 

and  its  surface  area  is  kdkt  •  Hence 

E(Y'Y)  =  (dktk)_1  /  r2kdkrk_1^r= (k/(k+2) )t2 

-2  -1 

and  t  E ( X ' B  X)  =  k/k+2 .  Therefore,  when 
-2  -1 

t  tr  (SB  )  >_  k/k+2,  the  sample  looks,  at  best,  as  if 
it  is  from  a  uniform  distribution.  Of  course,  as  v  -►  °°, 
the  truncated  normal  approaches  the  uniform  distribution. 

Before  proceeding  to  the  computation  of  the  asymptotic 

distribution  of  v^,  we  investigate  numerically  the 

dependence  of  v^  on  S  in  the  one  dimensional  case. 

2 

We  set  B  =  1  and  s  =  S  and  note  that  v^  =  00  when 
2  2 

s  >_  t  /3.  Table  5.2  contains  some  numerical  values  of 

2  2  ~  2 
s  /t  and  the  corresponding  v^/s  ,  the  factor  by  which 

2 

we  must  multiply  s  to  get  our  estimate. 

Theorem  5.4:  As  n  -*■  =° 


(5.23) 


L  (n1//2  (v3~v)  ) 


N(0,v3) 


where 
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(5.24)  V3 


^-(Fk(a2)  [2-bk(a)  (a2+2-k+kbk  (a)  )  ]  } 


Proof:  The  derivative  of  the  log  of  the  density  in  (5.10), 
written  as  a  function  of  a,  is 


d  log  *t  _  k 
da  a 


(a) 


and  the  second  derivative  is 


d  log  <t>. 


x,B_1x 


da 


T 


k  .  , 

“Tbk 

a 


a)  -  ^bk(a)  (1  -  V  bk(a)) 


Using  the  fact  that 


E( 


■%(!  -  bk(a)), 
a 


where  the  expectation  is  taken  with  respect  to  <j> ^ ,  the 

density  of  the  truncated  distribution,  and  recalling  (Al.7), 
2  2 

(A2.7),  and  t  =  a  v,  we  find  that  the  Fisher  information 
for  the  parameter  a  is 


k  k2  a2 

(5.25)  Ix(a)  =  2-^-(l  -  bk(a)  )  +  ^jb^a)  (1  -  —  -  bfc(a)  )  . 

a  a 


=  -  itv-  /2 
2 


Since 


da 

dv 


,  it  follows  by  Lemma  5.1  that  the 


information  associated  with  v  is 


2 

(5.26)  I,  (v)  =  (a)  . 

*  4v 


As  m  -*•  00 , 

(5.27)  L  (m1//2  (v3~v)  )  -►  N(0,I2(v)  1)  • 

But  as  n  -  =»,  m/n  -  p  *  Fk{a2).  Therefore  applying  (5.25), 
(5.26),  and  (5.27),  we  have  the  result  in  (5.23)  with 
V3  =  (Fk(a2)I2(v))'1.  * 

Theorem  5.5:  As  n  ~»  °° 

(5.28)  L  (n1/2  (v4  -  v)  )  -*N(0,V4) 


where 


(5.29)  v"1  «  V"1  +  V*1. 

Proof :  The  proof  is  a  calculation  similar  to  those  done 

in  the  proofs  of  Theorems  5.2  and  5.4.  • 

The  fact  that  V^1  =  V^1  +  V*1  in  the  preceding 

of  a  very  general  theorem  suggested 


theorem  is  an  instance 
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by  P.  K.  Bhattacharya.  We  will  not  give  a  formal  proof 
but  will  simply  outline  the  argument  for  a  simple  version 
of  it.  Suppose  X  has  a  density  fg (x)  and  suppose 
further  that  the  space  in  which  X  takes  values ,  X/  is 
partitioned  into  two  parts:  A  and  A;  so  X  =  A  U  A. 

Let  P Q  =  /  f„(x)dx  and  define  three  new  random  variables, 

9  A  9 

2  truncated  versions  of  X  and  an  indicator  based  on  X: 

(5.30)  XL  =  X-l [X  e  A] , 

(5.31)  X2  =  X-1[X  e  A] , 

and 

(5.32)  X3  =  1(X  e  A] . 

Then  X^  has  density 

f .(1)  (X)  =  p"Xf  (x)  l[x  s  A] 

-)  y  g 

and  X2  has  density 

fg2)  (x)  =  (l-?0) -1f0  (x)  1  [x  E  Aj  . 
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Of  course  is  a  Bernoulli  random  variable: 

Pr  (X3  =  1)  =  PQ.  Now  let  I]_(9),  I2(6)f  I3  ( 0 )  be  the 
Fisher  informations  associated  with  X^ ,  X2  ,  X3  and  1(9) 
the  information  associated  with  X.  Then  if  we  suppress 
some  algebra  and  assume  that  all  necessary  formal  manipu¬ 
lations  are  valid,  we  may  compute 


(5.33) 


3^(9) 


(  ,  d  .  f9, 2  f9  . 

/  l^log  £-}  p-dx 

A  9  9 


f  [^log  f0]2f0dx  " 


'  2 

(PJ 


(5.34)  I2(6)  *  / 

A  8  9 

'  2 

d  2  (pfl) 

=  /  [Jrlog  f0]  fpdx - — 

A  (1-P9) 


(5.35)  I3(9)  p  (1-P  : 

-2  1 


where  (5.35)  follows  from  Lemma  5.1.  But  using  (5.33), 
(5.34),  and  (5.35),  we  find  that 


(5.361 


1(9)  =  PaI,  (9)  +  (l-PJI,(e)  +  1.(9) 
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The  factors  Pg  and  (l-P^)  appear  in  (5.36)  because 
X-^  and  X2  are  only  non-zero  with  probabilities  P0  and 

1-P8  ’ 

What  we  have  shown  is  that  the  information  in  X  can 
be  partitioned  into  three  easily  interpreted  parts: 
information  from  2  complementary  truncated  random  variables 
and  information  from  a  Bernoulli  random  variable  which 
says  which  truncated  variable  is  observed.  No  doubt  a 
theorem  stating  that  such  a  partitioning  is  possible  can 
be  proved  in  considerable  generality. 

We  have  now  studied  five  estimators  of  v:  Vq  through  v4 
and  have  obtained  their  asymptotic  variances :  VQ  through  V4 
given  in  (5.7),  (5.13),  (5.17),  (5.24),  and  (5.29).  We 
define  the  efficiency  of  v^  to  be  Ei  =  Vvi  £or 
1  <_  i  £  4 .  Of  course  by  Theorem  5.5,  E4  =  We 

also  expect  that  E4  >_  ,  since  v4  and  v^  use  the  same 

data  and  v4  is  the  MLE .  In  Table  5.2  we  give  numerical 
values  for  E^  through  E4  for  k  =  1,...,7  and 
p  =  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  0.95, 

0.99,  0.999.  Note  that  all  of  the  efficiencies  depend 

-1/2  2 
only  on  a  =  tv  or,  alternatively,  on  p  =  Fk(a  ). 

Several  interesting  remarks  may  be  made  about  the 
results  in  Table  5.3.  Note  first  that  v^  is  extremely 


inefficient:  for  example,  when  k  =  1  and  p  =0.7, 

E3  =0.032.  This  means  that  if  one  bases  v^  on  lOOr 
observations,  one  will  do  about  as  well  as  one  would  do 
using  Vq  when  there  were  3r  observations.  It  is  impor¬ 
tant  to  remember,  of  course,  that  only  about  7 Or  obser¬ 
vations  will  be  used  in  computing  v^;  the  others  will 
be  truncated.  Still,  however,  the  result  is  striking. 

A  crucial  point  is  that  using  v^  is  the  best  we  can  do 
if  we  decide  that  we  cannot  guess  the  proportion  of  points 
that  are  not  truncated  (included  in  the  estimate) . 

When  k  =  1  and  p  =  0.7,  E2  =  0.556;  hence,  not 
knowing  any  of  the  actual  values  of  the  observations 
results  in  a  loss  of  only  about  50%  of  the  information. 

This  remark  is  less  surprising  when  one  considers  the 

-  I 

similarity  between  the  mode  of  estimation  used  in  v 2  and, 
for  instance,  the  use  of  the  median  to  estimate  the  mean 
of  a  normal  (see  equation  (5.9)).  Our  next  theorem  will 
demonstrate  that  v2  and  v2  are  asymptotically  equivalent. 
Finally,  we  observe  that,  in  general,  E^/E4  is  about  0.8; 
hence,  the  loss  of  information  due  to  not  using  the  MLE  is 
not  severe.  It  seems  appropriate  to  conclude  that  is 

U 

likely  to  be  a  reasonably  efficient  estimate. 


Theorem  5.6:  As  n 


oo 
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(5.37) 


(n1/2 ( v^  -  v) ) 


N(0,V1) 


and 

(5.38)  L(n1/2  (v'2- v))  -N(0,V2). 


Proof ;  We  will  prove  the  theorem  by  demonstrating  that 
(5.39)  V1  =  V1  +  °p(n  </2) 


and 

(5.40)  v'2  =  v2  +  o  (n“  /2)  . 


First  we  derive  (5.40).  We  will  write  F. 

Kn 

empirical  distribution  corresponding  to  F^. 
possible  to  rewrite  (5.8)  and  (5.9)  as 

^(p) 

(5.41)  V,  =  V  t  ; 

Fk  (P) 


for  the 
It  is 


and 


v„  = 


Fk1<E>» 


(5.42) 


v 
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_^/2 

By  a  Taylor  expansion,  since  p-p  =  0p(n  ) 


(5.43)  F'^p)  -  P-^P)  +  ~  P  V  '  +  °p(n 


-1/2, 


fk(Fk  (p)) 


and  using  the  Bahadur  representation  [2], 


(5.44)  r->)  -  F-1  CP)  4-  -  *  V" 

'k  Fk 


P  ...P- 


. -1/2 1 


knVC"  "k  f.(F."(p)) 


Hence,  by  (5.43) 


Fk1(p) 

(5.45)  =  1  - 

Fk1(P) 


-Zl -  _1  —  +  °p(n  /2) 

Fk  (p)fk{Fk  (p}) 


p  -  P- 


and  from  (5.44)  we  derive 


(5.46)  Fkn(p) 

Fk1(p> 


=  1  - 


P-P- 


-  +  op(n  /2) 


But  (5.45)  and  (5.46)  imply  (5.40).  The  argument  for 
(5.39)  is  very  similar.  First  observe  that  we  may  rewrite 
(5.4)  and  (5.5)  as  follows: 


15.47) 


v,  *  v 


F’hp) 

r/ 

.10 


xdF 


xn 


1  _E£_ 


i.  f'1  (P)  j  [np] 


xdF, 


.Oll _ 
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(5.48) 


Fk1<p) 


Using  a  Taylor  expansion  we  may  write 


Fki(P) 

(5.49)  /  xdF 

0 


Fk1(p)  1 

/  xdF.  +  FjT1  (P)  (P-P)  +  (n"~  /2), 

0  XX  p 


and  by  the  uniformity  .lemma  of  chapter  3  (Lemma  3.3),  since 
(5.44)  holds  and  p-p  =  0  (n  ^2) , 

p 


(5.50)  / 

0 


pki(p> 


Fkl<p»  pki(p> 

/  XdFkn  *  /  XdF, 


Fk1(P)  1 

-  /  xdF  +  o  (n~  /2) 

0  x  p 


rk1(p) 


+  F^1(p)fk(F'1(p))  (F~^(p)-Fk1(p)  ) 


+  o  ( n  '  2 1 
P 


***** 
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Substituting  (5.44)  in  (5.50)  we  obtain 


kn 


Fk1(p' 


,-l 


(5.51)  /  xdFRn  =  /  xdFkn  +  FkX(p)(p"p)  +  °p(n 


V2, 


.-1 


If  we  divide  f  xdF,  by  (5.49)  and  expand,  then  we  find 
;  kn 


Fk1(p)  Fk1(p) 

l  xdFkn  l  xdFkn  F"1(p)(p-p)  1, 

(5.52)  - -S-tt - (1--~I - >+0p(n  > 

’  i,~'  ”  Fk  (P)  y 


Fk*(p)  F“1(p) 


/  / 

0  Kn  0 


XdF, 


/  xdF; 


F-hp) 


-  1 


V(p) 


-1/2 

xdF,  =  /  xdF,  +  0  (n  /  )  ,  we  may 

0  kn  0  k  P 


Since  / 

0 

write  (5.52),  using  (5.47),  as 


Fk1(p) 

l  xdFkn  F“1(p)(p-p)  .l/2 

(5.53)  v,/v  =  - - - - - — -  +  °~(n  7  ) 


'U  v  ~  - =T 

Fk  (p) 


F^IP) 


/  *dFk  l 


xdF, 


f'1  (p) 


3ut  using  (5.43),  and  dividing  (5.51)  by  j  xdFk' 


we  may  conclude 


(5.54)  v^/v  = 


F’hp) 

,  xdFkn  F^tpHo-p)  _i  /-) 

- - S - Y - *  o  (n  /l)  , 

F  X (p)  f7 1  ( p )  P 


/  xdF  k  / 


xdF, 


I 

\ 

f 

I. 
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which  demonstrates  (5.39).  ® 

All  of  the  analysis  we  have  done  in  this  chapter  has 
been  based  on  the  assumption  that  B  is  known  and  that  the 
mean  of  the  normal  distribution  we  are  working  with  is  0,  or, 
equivalently,  that  it  is  known,  say  equal  to  y .  What 
relevance  do  our  results  have  when  y,  B  are  themselves 
unknown,  as  is  the  case  in  most  applications? 

It  is  interesting  that  the  information  matrix 
I(v,y,B)  has  a  block  diagonal  structure,  that  is, 

/xl(v) 

(5.55)  I ( v , y , B)  =  I 

\  0  1 2  ( U  ,  3 )  / 

when  the  density  or  X  is 

(5.56)  i(x;v,y,B)  =  (2tv)  '/2exp  (-  -^(x-y)'B  1(x-u)). 

To  see  that  this  assertion  is  true,  we  first  reparameterize 
d  by  a  differentiable  transformation  (u,B)  =  g ( 5 ) ,  where 
5  ranges  over  some  open  subset  of  a  Euclidean  space. 

(Recall  that  B  is  subject  to  the  restriction  B  =1.) 


'hen  we  may  rewrite  (5.56)  as 
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(5.57)  log  <(>  (x;  v,  0 )  =  -  |  log  (2irv)  -  j^h(x,9) 
where  h  is  a  differentiable  function.  By  differentiating 

ji 

(5.57)  with  respect  to  0  and  noting  that  E ( jg  log  <j>)  *  0 , 

£ 

we  conclude  that  E(yg-  h(X#0))  =  0.  But  that  implies  that 
2 

E(g|gQ  4>)  =  E((2v2)-1  Jg  h(X,9) )  =  0,  which  means  that  (5.55) 

is  true.  Of  course,  the  consequence  of  the  fact  that  the 

information  matrix  is  block  diagonal  is  that  the  asymptotic 

distribution  of  the  MLE  of  v,  v,  is  the  same  whether  (y,B) 

is  known  or  is  simultaneously  estimated  with  v.  So,  for 

~  2 

instance,  the  asymptotic  variance  of  vQ ,  VQ  =  2v  /k, 
given  in  equation  (5.7)  is  also  the  asymptotic  variance 
of  the  MLE  of  v  when  y  and  B  are  estimated  at  the  same 
time . 

The  information  matrix  will  also  be  of  the  form 
shown  in  (5.55)  if  we  observe  the  truncated  or  censored 
version  of  (5.56),  the  densities  for  which  are  given  in 
equations  (5.10)  and  (5.11)  (though  in  these  formulas  x 
must  be  replaced  by  x-y) .  But  the  block  diagonal  structure 
is  dependent  on  having  the  truncation  or  censoring  done 
with  respect  to  the  correct  values  of  y  and  B! 


Certainly,  in  our  applications  the  truncation  or  censoring 
is  done  in  a  data  dependent  way  and  therefore  does  not 
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satisfy  this  requirement. 

In  practice,  we  expect  that  the  estimates  of  y  and 
B  to  be  used  in  the  various  scale  estimates  will  be  de¬ 
rived  from  ellipsoidal  trimming.  Conceivably,  then,  one 
could  derive  the  asymptotic  distributions  of  v2 ,  v^ ,  v^ 

(when  y  and  B  are  estimated)  using  the  kno%m  asymptotic 
distribution  of  the  stationary  point  of  the  IET  algorithm. 

Such  a  line  of  analysis  appears  to  present  considerable 
difficulties,  although  perhaps  they  are  not  insurmountable. 

The  asymptotic  distribution  of  (see  (5.1))  follows 

as  a  direct  consequence  of  the  distribution  of  the 

-  «  ~  .  T/V 

stationary  point?  since  v^  =  \Z^\  ,  its  distribution 

may  be  computed  directly. 

We  shall  not  attempt  to  carry  out  any  of  the  above 
program  here  and  it  is  true,  as  a  result,  that  our  Knowledge 
concerning  the  various  estimates  of  scale  remains  fundamen tally 
incomplete.  It  is  our  hope  nevertheless  that  the  various 
numerical  and  analytical  results  concerning  V^,...,V4  do 
provide  a  rough  picture  of  these  estimators.  For  instance, 
they  provide,  at  the  minimum,  lower  bounds  to  the  true 
squared  error. 


Conclusion 


In  this  thesis  we  have  studied  the  iterative  ellipsoidal 
trimming  algorithm  from  a  number  of  different  points  of 
view.  Based  on  our  experience  with  IET  as  a  data  analytic 
tool  and  the  plausibility  arguments  in  chapter  2  concerning 
"its  behavior,  we  would  conclude  that  it  can  serve  an 
important  role  as  a  clustering  algorithm  when  the  clusters 
being  sought  are  approximately  ellipsoidal.  It  will  be 
especially  useful  when  the  statistician  wishes  to  simulta¬ 
neously  find  a  cluster  and  estimate  its  location  and  shape 
(perhaps  as  a  prelude  to  searching  for  smaller  hidden 
clusters  in  its  tails) . 

In  chapters  3  and  4  we  obtained  analytical  and 
numerical  results  concerning  the  performance  of  IET  in 
its  capacity  as  an  estimation  tool.  Certainly  the  most 
pressing  work  still  to  be  done  in  that  area  is  the  proof 
of  the  conjecture  we  stated  at  the  end  of  chapter  3  con¬ 
cerning  the  distribution  of  the  stopping  point  of  IET. 

The  fifth  chapter  dealt  with  the  problem  of  estimating 


the  scale  of  a  cluster  after  one  has  already  obtained 
estimates  of  its  location  and  shape.  The  status  of  the 
results  presented  there  is  somewhat  unsatisfactory,  as 
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Appendix  1 

Spherically  Symmetric  Distributions 

In  this  appendix  we  collect  a  variety  of  results 
concerning  spherically  symmetric  distributions.  Suppose 
that,  as  in  chapter  3,  g(t)  is  a  nonnegative  function 
defined  on  the  nonnegative  reals  whose  behavior  as  t  -*■  °° 
and  degree  of  differentiability  are  suitable  for  our  pur¬ 
poses,  where  "suitable"  means  that  all  of  the  formal 
manipulations  involving  g  that  we  perform  are,  in  fact, 
valid.  We  generate  a  spherically  symmetric  density  for 
each  dimension  k  ^  1,  f(x)  =  ck  g(|x[  ),  where 

2  ^2  2 
1 x |  *  l  xf.  The  density  of  the  generalized 

i=l  1  w 

.  2 

variable,  T  =  |X|  ,  where  X  'v  f(x),  may  easily  be  derived 
using  (A1.6)  and  equals 


(Al.l) 


fk(t) 


Trk/2t(k/2)-lq(t) 

T(k/2)ck 


and  its  cdf  is  Fk(t). 

From  each  f(x),  in  turn,  we  may  generate  a  multi¬ 
variate  location-scale  family,  f(y?  y,  A),  where 
f(y;  u,  A)  is  the  density  of 

1/2 

Y  *  u  +  A  /‘iX, 
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and  A  is  symmetric  and  positive  definite.  It  is 
reasonable  to  refer  to  these  distributions  as  ellipsoidal 
distributions.  Incidentally,  while  the  expectation  of  Y 
is  y,  the  covariance  of  Y  is  not,  in  general,  A.  Since 
the  covariance  matrix  for  X  is  just  (2irck)  1c^+2I^ 

(a  fact  which  follows  from  equation  (Al.7)  below),  the 
covariance  of  Y  is  %  -  (2T:ck)  1ck+2A* 

-t/2 

If  g(t)  =  e  '  ,  then  the  preceding  construction 

leads  to  the  multivariate  normal  family  of  distributions, 
k/2 

with  ck  ■  (2ir)  .  Another,  more  general,  form  for  g 

that  is  a  valuable  source  of  examples  is 


(A1.2) 


g (t)  =  exp (-rt  ) . 


In  this  case,  ck  *  Trk//2r  k</2ss  (k/2)  1T  (k/2s)  . 

The  bias  reducing  function  introduced  in  Chapter  3, 


bk(a)  = 


cklg(a2) 

Fk(a2)/Vk(a) 


,  can  be  easily  reexpressed  in  terms 


of  fk  and  Fk  as 


(A1.3) 


or  as 


bk(a) 


2a2f k (a2 ) 


kFk (a  ) 


I 


■dii 


(A1.4) 


bk(a) 


t(27rCk}  <W 


AT4 


by  making  use  of  the  formula  for  the  volume  of  a 
k  dimensional  sphere  of  radius  a, 


k/2 


(A1.5) 


Vk(a)  *  r  (  (k/2) +1) 


given,  for  instance,  in  Apostol  [1,  p.  411].  It  will  be 
useful  to  record,  in  addition,  that  the  surface  area  of 
such  a  sphere  is 


2'rr^^2  k-1 

(A1.6)  \(a)  "  r(k/2)a 


The  next  lemma  expresses  a  variety  of  integrals  in 

2 

terms  of  the  generalized  x  density  and  distribution 
function.  We  will  set  S  =  S_(0),  the  ball  of  radius  a 

fl 

centered  at  the  origin. 

Lemma  Al . 1 : 

(A1.7)  / (x' x) c.^g (x1 x) dx  =  kM. 

S  * 

(A1.8)  /  x2c^g  (x' x)  dx  * 

s 

(A1.9)  /(x'x)  2c)c1g(x'x)dx  =  k(k+2)M2 


(A1.10) 

/  x^c^gtx'jOdx  =  3M2 

s 

(Al.ll) 

/  x^x2ck1g(x'x) dx  =  M2 

s 

(A1.12) 

/ (x,x)cr1g' (x*x)dx  =  kM, 
s  K  3 

(A1.13) 

/  xlck1(3,  (x'x)dx  =  M3 

(A1.14) 

/ (x'x) 2c~1g' (x'x) dx  =  k(k+2)M. 

S  K  4 

(A1.15) 

/  x^c^g'  (x'x)dx  =  3M4 

s 

(A1.16) 

/  xix2c^1g ' (x'x)dx  *  M4 
s 

where 

(A1.17) 

%  - 

(A1.18) 

M2  =  (2’')'2<=i1ck+4Pk+4(a2) 

(A1.19) 

Mj  =  (2nok)  c]t+2fjc+2(a  * 

Fk(a2)/2 

(A1.20) 

M4  =  (2^)-2C-1oJC+4f]c+4(a2)  - 

( 2TTCk)  ck+2Fk+2 

Proof :  We  may  demonstrate  (Al.7)  by  putting  r  =  x'x, 

integrating  with  respect  to  Aj,(r)dr,  and  then  setting 

t  =  r  .  Of  course  (Al.8)  is  a  trivial  consequence  of 

(Al.7).  To  derive  (Al.9),  use  the  same  substitutions  as 

1/2 

for  (Al.7).  To  obtain  (A1.10),  let  r  =  (x'x)  '  ,  and 
y^  *  xi  for  1  <_  i  <_  k-1;  then  the  integral  may  be  written 


2/  dr  c"1g(r2)r  /  _  2  (r2 

0  E  y  <r 

i<k  1_ 


l  y I) 

i<k  1 


2,-/2  4 


Y^Y- 


But  it  is  easy  to  show  that 


,  ,  2  2  r  2-l/2J 
!  (r  -y,"  E  y ■)  dy2 

2  2  2  i 
E  y“<  r' ~Y\  l<i<k 

l<i<k 


dY*_' 


=  (r2-y2) (k"3)/2  -1 
Yl‘  T( 


(k-l)/2 


r ( (k-i)/2) 


from  which  it  follows  that 


„  2  2 

^  yjlr 

i<k 


(r2-  E  y2)-  /2vjdy 


k (k+2 ) r (k/2 ) 


Then,  finally,  the  integral  in  (Al.10)  is  straightforward 


to  obtain.  Since  (x'x)  is  the  sum  of  k  terms  of  the 

4  2  2 

form  x^,  and  k(k-l)  terms  of  the  form  x^x^  (i^j), 

by  symmetry  we  conclude  that 


/(x'x)2 f(x)dx  =  k  /  x^f(x)dx  +  k(k-l)  /  x^x?f(x)dx 
S  S  S 


which  implies  (Al.ll).  The  five  integrals,  (Al.12)  -  (A1.16) 
are  entirely  analogous  and  are  obtained  in  a  similar  way, 
the  only  difference  being  that  an  integration  by  parts  is 
necessary  to  get  rid  of  the  g‘  (hence,  the  presence  of 
2  terms  in  the  expressions  for  and  M4) .  1 

It  follows  from  equation  (Al.8)  that  the  covariance 
of  the  truncated  spherically  symmetric  distribution  is  just 


2  -1 

(A1.21)  Cov  (X |  X '  X  <  a  )  =  (2irck)  c 


k+2 


Fk+2la2> 

Fk(a2) 


Similarly,  the  covariance  of  the  truncated  ellipsoidal 
distribution  is 


..-1 


(A1.22)  Cov(Y|  (Y-u)  'A  "(Y-u)<a“)  ■  (2ttc,  )  ~1c.  ?  7 - A. 

k  Fk(a2) 


Fk+2(a‘ > 


Recalling  that  the  covariance  of  Y  is  %  =  (2‘n'c^)  ^Cj^A, 
and  defining 

EW5'k1(p>» 


(A 1.23) 


c(k,p) 
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2 

where  p  =  F,  (a  ) ,  we  may  finally  write  the  covariance  of 
the  truncated  distribution  as  simply  c(k,p)£. 
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Our  two  intentions  in  this  appendix  are  to  specialize 
the  results  of  appendix  1  to  the  normal  distribution  and 
to  give  some  detailed  analytical  and  numerical  information 
about  the  function  b^(a)  in  this  case. 

k/2 

First  note  that  since  ck  =  (2tt)  when 

X  v  N(0,IjJ,  as  we  shall  always  assume  in  this  appendix, 

-1  -2  -1 

it  follows  that  (2ttc^,'  cjc+2  =  ^  anc^  (2^)  ck+4  = 

which  results  in  several  simplifications  in  the  formulas 
of  appendix  1.  For  example,  (Al.4)  becomes 


(A2.1) 


bk(a}  = 


2<Wa2> 

Fk(a2) 


A  well  known  property  of  the  chi-squared  distribution  is 
that 


(A2.2)  2fkJ.2(t)  ■  Fk(t)  -  Fk+2(t), 

a  fact  which  may  be  verified  by  differentiating  both  sides 
of  the  equation  (and  which  is  not  true  in  the  general 
spherically  symmetric  case).  By  making  use  of  (A2.2),  we 
find  that  equations  (Al.17)  -  (A1.20)  may  be  greatly 
simplified.  Now, 
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(A2.3)  M]_  =  Fk+2(a2) 

(A2.4)  M 2  =  Fk+4(a2) 

(A2.5)  M3  =  -  7Fk+2(a2) 

(A2.6)  M4  =  -  |Fk+4(a2)- 

Actually,  (A2.5)  and  (A2.6)  follow  immediately  from  (A2.3) 
and  (A2.4),  since  g'(t)  =  -  ^g(t)  for  the  normal  case. 
Another  important  fact  is  that  as  a  result  of  (A2.1)  and 
(A2.2)  , 

(A2.7)  bk(a)  *  1  -  Fk*2<.  * . 

V*  > 

But  as  a  consequence  of  (A2.7)  we  find  that  the  covariance 
of  the  truncated  multivariate  normal  (the  specialization 
of  (A1 . 22 ) )  is: 

(A2.8)  Cov(Y| (Y-u) '%~l (Y-u)  <  a2)  =  (l-bk(a))Z. 


Another  way  of  saying  the  same  thing  is  that  for  a  normal 
distribution,  by  (Al.23), 
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(A2.9)  c(k,p)  =  1  -  bk ( (F”1 (p) )  /2). 

We  collect  a  few  facts  about  bk  in  the  next  lemma. 

Lemma  A2 . 1 ;  The  function  bk(a)  is  strictly  decreasing  on 
[O,00).  As  a  goes  from  0  to  °°,  bk  decreases  from  1 
to  0.  Furthermore, 

(A2.10)  bk(a)  =  (k/a)bk(a)  (1  -  a2/k  -  t>k(a)) 

and 

(A2.ll)  bk(a)  =  1  ”  a2/{k+2)  +  o  (a2)  as  a  -  0. 

Proof :  Let  rk  =  [2k/2T (k/2) ]~l .  Then, 

fk(a2)  =  rkak"2(1  “  a2/2  +  °(a2)) 

and 

(A2.12)  Fk(a2)  =  2k"1rkak(l  -  (k+2)_1ka2/2  +  o(a2)). 
Therefore,  using  (A1.3), 

bk(a)  =  (1  -  a2/2) ( 1  +  (k+2) ‘1ka2/2 )  +  o(a2) 
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which  implies  (A2.ll),  which  in  turn  implies  that  ■*  1 
as  a  +  0,  To  prove  that  is  decreasing  we  shall  show 

that 

(A2.13)  bk<a)  >  1  '  a2/(k+2)  Va  >  0, 

which,  in  conjunction  with  (A2.10)  and  the  fact  that  bk  >  0 , 

implies  the  result.  Of  course  (A2.13)  is  trivially  true 

2  2  2 
when  a  >_  k+2.  So  assume  a  <  k+2 ,  let  t  =  a  ,  and  call 

2tf  (t) 

d(t>  =  k~F~ ' ' ( t )  -  (1  "  t/ (k+2) )  . 
k 

Then  it  will  suffice  to  show  d(t)  >0  for  0  <  t  <  k+2. 

) .  Since  dQ  has  the 
to  prove  that  dQ(t)  >  0, 
0  for  0  <  t  <  k+2 . 

0 


asymptotic  behavior  of  b^. 


Let  dQ(t)  =  d(t)Fk(t)/(l  -  t/ (k+2 ) 

same  sign  as  d,  it  will  be  enough 

\ 

or  since  dQ(0)  =  0,  that  dg(t)  > 
But 


dQ(t)  = 


2fk(t> 


k  { 1-t/ (k+2 ) ) 


(err)  > 


k+2 


for  0  <  t  <  k+2 .  ■ 

* 

Our  next  result  describes  the 


a  .  ill  ■'  1  ^[‘ii  tujf. 
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Lemma  A2 . 2 : 


(A2.14) 

bk((F^(P))1/2)  *  <2A)1/2  *'»p 

Ml  as  k 

(A2.15) 

bk((Fk1(P,)1/2)  = 

.  r((k/2)+l)2/l1  2/k  2/k 

1  (k/2)  +  1  p  r  1 

as  p  -*■  0 

Proof : 

/2k  fk(k  +  a /2k)  -►  <J>  <a)  as  k  00 

00 


and 


Fk  $_1(p)  as  k  -*■  00 . 

/2k 

Hence,  /2k  fk(F~1(p))  -  4>($'1(p))  as  k  -  °°. 
But  then 


(k/2)1/2bv((F71(p))1/2)  =  (2/k)1/2F‘1(p)fk(F^1(p))/p 


k  vt'',ukv  k 

t  1/2  ^2k  fv (F*1  (P) ) 

,  (2/H)1/2(  k  *  <*/2>  V  > - 5JS - 

/2k 

_  as  k  *  . 

P 


which  implies  (A2.14). 
By  (A2.12) , 
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P  -  F.  (a  ) 


<2/k)-i zrr - 

2 W*T  (k/2) 


+  o  (a  )  as  a  ■»  0. 


Hence 


a2  =  [2k/2r((k/2)+l)p]2/k  +  o  (p2/k)  as  p  -►  0. 


Using  (A2.ll)  we  conclude  that  (A2.15)  holds.  B 

We  end  this  appendix  by  presenting  some  numerical 
values  of  b.  in  Table  A2.1. 
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Appendix  3  -  Proof  of  Lemma  5.2 
We  prove  Lemma  5.2. 

_  2 

Lemma  5.2;  The  function  s(a)  =  ka  (1  -  bk(a))  is  mono¬ 
tone  decreasing  for  a  >  0.  As  a  -*•  0,  s(a)  k/(k+2) 
and  as  &■+<*>,  s(a)  -*■  0. 

Proof:  By  (A2.ll)  in  Lemma  A2.1,  bk(a)  =  1  -  a2/(k+2)  +  o(a2). 

Therefore ,  s(a)  -  k/(k+2)  +  o(l)  as  a  -  o.  Since  bR  is 
bounded,  s(a)  -*■  0  as  a  -*•  ».  After  some  algebra  we  find 
that 

s' (a)  *  -ka~3 [2  +  (k-2-a2) b^ (a)  -  kb2(a)]. 


Let 


Q  (x) 


2 


x 


.k-2-a2 
v  k 


) x  -  2/k. 


Then,  s' (a)  <  0  iff  Q(b^(a))  <  0.  The  quadratic  equation 
Q(x)  =  0  always  has  both  a  positive  and  a  negative  solution; 
we  define  h.(a)  to  be  the  positive  solution: 


hk(a) 


Since  as 


a^+2-k  2 

[(--2^  ■  )  +  2/k] 


1/2 


a2+2-k 
2k  ' 


Q  (x) 


+  00 ,  we  may  conclude  that 


^  — -  -t  tiku  m  ■- 


L 
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Q(bk(a))  <  0  iff  bk(a)  <  hk(a),  since  bk(a)  >0  Va  >  0. 

Note  that  hk(0)  *  1  and,  therefore,  hk(0)  “  bk(0)  =  °* 

We  will  show  that  gk(a2)  =  (bk(a)  -  bk(a) ) Fk (a2 ) /hk (a)  is 

positive  for  all  a  >  0  by  showing  that  its  derivative 

2 

is  always  positive.  Let  s  =  a  ,  q  -  (s+2-k)/2k,  and 
2  1/2 

r  *  (q  +  2/k)  .  Then,  after  a  tedious  computation  we 

find  that 


gk(s) 


fk  S  ,  (k+2) 

—  - ?l — 2 — r 

r(r-qr  k 


( (k+2)  ~  s (k-2) ) 

2k3 


It  will  suffice  to  show  that 


k+2  , 


(k+2)2  -  s(k-2) 


But  upon  squaring  and  expanding  both  sides  of  this  inequality, 

and  after  more  tedious  algebraic  manipulations,  we  find 

2  2 

that  it  is  equivalent  to  (k+2)  >  (k-2)  ,  which  is,  of 

course,  true  for  k  >  0.  B 
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Figure  2.1 


S+M.  r-  Yeilv* 


f— 1 — [-  |  immuwii 


j— •  JT***-  n«^ 

i  »  H - 1  'j-j. 


0  +r i*<  f**.*.* 


This  figure  represents  one  iteration  of  IET  in  one 
dimension. 


Figure  2.2 


The  stopping  "window"  of  IET  may  contain  two  clusters 
if  p  is  too  big.  Figures  2.2  and  2.3  illustrate  this 
phenomenon  in  1  and  2  dimensions,  respectively. 
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Table  2.1  - 

Example 

_1 

^2 

|llI 

IL2I 

-2 

2 

8 

18 

-2.17 

2.98 

11 

15 

-1.51 

3.52 

14 

12 

-1.05 

4.25 

18 

8 

-0.53 

5.72 

20 

6 

-0.23 

6.80 

23 

3 

0.17 

10.75 

25 

1 

0.45 

25 

25 

1 

Note:  The  successive  estimates  are 

produced 

by  k-means 

operating  on  8 

N  (-2 »  1)  and 

17  N ( 2 , 

1)  observations 

and  one  outlier 

at  x  =  25. 

Table  2.2  - 

Example 

_2 

U-j_  cr 

2 

1_  ^2 

~  2 
l2_ 

|LXI 

u.2i 

-2  1 

2 

1 

8 

18 

-2.18  0. 

59  2.98 

29.8 

6 

20 

-2.21  0. 

15  2.47 

2  9.3 

3 

23 

-2.41  0. 

013  1.89 

27.8 

2 

24 

-2.49  10 

~4  1.72 

27.3 

0 

26 

Note  ;  The  successive  estimates  are  produced  by 
k-means  2  operating  on  the  same  sample  as  was  used  in 
Example  1. 
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Table  3.1 


£ 

<r1(l-p) 

p"10(4>"1(l-p)) 

00 

o 

1 

o 

• 

00 

0.35 

w 

o 

0 

0.80 

0.2 

0.84 

1.40 

0.1 

1.28 

1.76 

0.001 

3.09 

3.37 

Table 

4.1  -  Values 

»  of 

k 

1-mk 

1 

0.798 

0.363 

2 

0.886 

0.215 

3 

0.921 

0.151 

4 

0.940 

0.116 

Note : 


is  given  in  equation  (4.1). 
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Table  4.2  -  Simulation  A 


eo 

el 

e2 

e3 

e4 

U2  =  0 

0.092 

0.109 

0.124 

0.133 

0.138 

(0.004) 

(0.006) 

(0.007) 

(0.008) 

(0.008) 

y2  =  1 

‘0.104 

0.126 

0.148 

0.161 

0.168 

(0.006) 

(0.006) 

(0.007) 

(0.008) 

(0.008) 

y2  »  2 

0.154 

0.146 

0.155 

0.165 

0.170 

(0.006) 

(0.007) 

(0.007) 

(0.008) 

(0.008) 

u2  =  3 

0.209 

0.152 

0.152 

0.161 

0.170 

(0.006) 

(0.008) 

(0.008) 

(0.009) 

(0.009) 

U2  -  4 

0.267 

0.159 

0.151 

0.155 

0.160 

(0.007) 

(0.007) 

(0.007) 

(0.007) 

(0.007) 

Note : 

In  all  runs, 

k  =  2  ,  n., 

=  100  ,  n.2  = 

=  10  ,  n^  =  0  , 

s  =  100, 

Pi  =  o. 

5  for  1  £  i 

£  4 ,  and 

y2  =  j  is 

to  be  interpreted  as 

^2  ~  (j,0).  Standard  errors  are  in  parenthesis.  In  all 
simulations,  y,  =  (0,0). 
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Table  4.3  -  Simulation  B 


eo 

S1 

S2 

e3 

e4 

U2  -  0 

0.089 

0.103 

0.122 

0.122 

0.114 

(0.004) 

(0.006) 

(0.006) 

(0.006) 

(0.006) 

u2  =  1 

0.096 

0.111 

0.127 

0.131 

0.127 

(0. 005) 

(0.006) 

(0. 006) 

(0.006) 

(0.006) 

U2  -  2 

0.157 

0.142 

0.141 

0.143 

0.134 

(0.006) 

(0.006) 

(0.008) 

(0.007) 

(0.008) 

U2  =  3 

0.206 

0.149 

0.136 

0.133 

0.119 

(0.007) 

(0.008) 

(0.008) 

(0.007) 

(0.006) 

u2  =  4 

0.269 

0.169 

0.142 

0.125 

0.109 

(0.007) 

(0.008) 

(0.008) 

(0.008) 

(0.006) 

Note :  Same  parameters  as  in  simulation  A  except  for 

p^  =  0.5,  P2  =  0.6,  =  0.7,  p^  =  0.8. 


i*tl‘  II  ■  i.iUm 


■  A. 
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Table  4.4  -  Simulation  C 


eo 

el 

e2 

S3 

S4 

U2  =  0 

0.092 

0.100 

0.105 

0.107 

0.107 

(0.004) 

(0.005) 

(0.006) 

(0.006) 

(0.006) 

u2  -  1 

0.107 

0.114 

0.124 

0.128 

0.129 

(0.005) 

(0.006) 

(0.006) 

(0.006) 

(0.006) 

y2  =  2 

0.161 

0.136 

0.136 

0.135 

0.135 

(0.006) 

(0.006) 

(0.007) 

(0.007) 

(0.007) 

U2  -  3 

0.199 

0.116 

0.112 

0.112 

0.111 

(0.006) 

(0  .006) 

(0.006) 

(0.006) 

(0.006) 

U2  =  4 

0.273 

0.117 

0.120 

0.120 

0.120 

(0.008) 

(0.007) 

(0.007) 

(0.007) 

(0.007) 

w2  "  5 

0.329 

0.103 

0.107 

0.111 

0.112 

(0.007) 

(0.006) 

(0.006) 

(0.006) 

(0.006) 

Note : 

Same  parameters  as  in 

simulation  A 

except  for 

ii 

O 

8  for  1  £  i 

i  4. 

ik 


4i»  —  ■  *•-- 


— 

- 1 
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Table  4.5  -  Simulation  D 

< 

Run 

1. 

(3,  100, 

0.5,  0.5,  0.5,  0.5) 

0.302 

0.195  0.175 

0.171 

0.169 

, 

(0.015) 

(0.010)  (0.009) 

(0.008) 

(0.008) 

2. 

(3,  100, 

0.5,  0.6,  0.7,  0.8) 

< 

0.280 

0.186  0.143 

0.121 

0.106 

■j 

i 

:  i 

(0.013) 

(0.010)  (0.008) 

(0.007) 

(0.007) 

? 

3. 

O 

o 

rH 

CO 

-w' 

0.8,  0.8,  0.8,  0.8) 

0.310 

0.126  0.111 

0.110 

0.112 

! 

ig 

(0.015) 

(0.006)  (0.006) 

(0.006) 

(0.006  ) 

\\ 

4 

4. 

(4,  100, 

0.5,  0.6,  0.7,  0.8) 

;i 

0 

| 

'  * 

0.328 

0.202  0.145 

0.111 

0.096 

(0.016) 

(0.010)  (0.007) 

(0.006) 

(0.005) 

i 

5. 

(4,  400, 

0.5,  0.6,  0.7,  0.8) 

1 

0.533 

0.270  0.170 

0.126 

0.099 

I 

f 

(0.026) 

(0.012)  (0.008) 

(0.006) 

(0.005) 

I 

6. 

(4,  400, 

00 

• 

o 

00 

• 

o 

00 

o 

00 

o 

0.511 

0.138  0.109 

0.106 

0.105 

>■ 

(0.027) 

(0.008)  (0.006) 

(0.006) 

(0.006) 

[ 

7. 

(6,  400, 

0.5,  0.6,  0.7,  0.8) 

0.558 

0.265  0.171 

0.131 

0.102 

(0.029) 

(0.012)  (0.008) 

(0.006) 

(0.005) 

Note:  The  format  is 

T 

(u2'  o 

•  P2'  P3'  P4) 

®0 

¥1  *2 

®3 

¥4 

(a  ) 

(a  )  (3_  ) 

(a  ) 

(a  )  . 

e0 

el  S2 

e3 

e4 

For 

all  runs 

r  ®  100 ,  1^2  =  ^3  ~  10/  s 

=  100,  Jc 

=  2. 

A  _ 
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Table  4.6  -  Simulation  E 


Run 


1. 

(3,  10, 

0) 

0.167 

0.139 

0.139 

0.131 

0.121 

(0.005) 

(0.005) 

(0.005) 

(0.005) 

(0.004) 

2. 

o 

r— 1 

VO 

0) 

0.279 

0.129 

0.132 

0.124 

0.115 

(0.005) 

(0.005) 

(0.005) 

(0.005) 

(0.004) 

3. 

(7,  10, 

10) 

0.596 

0.193 

0.137 

0.118 

0.100 

(0.020) 

(0. 008) 

(0. 005) 

(0.004) 

(0.004) 

4. 

(7,  10, 

20) 

0.707 

0.219 

0.136 

0.103 

0.147 

(0.024) 

(0.006) 

(0.005) 

(0.004) 

(0.004) 

Note 

:  The 

format  is  the 

same  as  in 

simulation  D  except  that 

the 

parameter  vector  is 

( U2'  n2'  n3 ) 

.  For  all  runs 

nl  = 

100,  a 

2 

400,  pL  = 

0.5,  p2  =  0 

•  6 ,  P-j  =  0.7,  p^ 

II 

o 

00 

and  k  =  4 . 


Table  5.1 


Values  for  c(k,p)  in  the  normal  case 
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Table  5.2 

Numerical  values  of  v,  when  k  =  1 


2  /4-2 

S  /t 

v3/s2 

0.081 

1.006 

0.108 

1.03 

0.146 

1.10 

0.193 

1.29 

0.214 

1.42 

0.235 

1.66 

0.255 

2.00 

0.274 

2.53 

0.290 

3.44 

0.324 

12.40 
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Tahle  5.3  -  Efficiencies  of  Scale  Estin.atorj_ 


0.012 

0.076 

0.098 

0.113 

0.125 

0.135 

0.143 

0.101 

0.155 

0.191 

0.216 

0.234 

0.248 

0.260 

0.158 

0.238 

0.283 

0.314 

0.335 

0.352 

0.366 

0.228 

0.324 

0.376 

0.409 

0.433 

0.450 

0.465 

0.307 

0.414 

0.469 

0.503 

0.527 

0.545 

0.559 

0.398 

0.510 

0.564 

0.597 

0.620 

0.637 

0.650 


0.055 

0.100 

0.131 

0.153 

0.170 

0.183 

0.194 

0.120 

0.199 

0.246 

0.278 

0.300 

0.316 

0.329 

0.194 

0.297 

0.351 

0.385 

0.407 

0.424 

0.437 

0.277 

0.391 

0.445 

0.477 

0.497 

0.512 

0.523 

0.368 

0.480 

0.528 

0.553 

0.569 

0.580 

0.588 

0.463 

0.560 

0.594 

0.610 

0.620 

0.625 

0.629 


*3 

0.000 

0.000 

0.000 

0.001 

0.001 

0.002 

0.002 

0.000 

0.001 

0.002 

0.004 

0.006 

0.007 

0.009 


0.055 

0.100 

0.132 

0.154 

0.171 

0.185 

0.196 

0.120 

0.200 

0.249 

0.282 

0.305 

0.324 

0.338 


0.000 

0.194 

0.003 

0.300 

0.007 

0.358 

0.011 

0.396 

0.015 

0.422 

0.018 

0.442 

0.021 

0.458 

0.001 

0.278 

0.009 

0.400 

0.017 

0.462 

0.024 

0.501 

0.030 

0.527 

0.035 

0.547 

0.040 

0.562 

0.004 

0.372 

0.020 

0.500 

0.034 

0.561 

0.045 

0.598 

0.054 

0.623 

0.062 

0.642 

0.068 

0.656 

0.012 

0.475 

0.040 

0.600 

0.062 

0.656 

0.079 

0.690 

0.092 

0.712 

0.103 

0.728 

0.111 

0.740 

’  A  .  — 


T 


P  k 

1 

2 

3 

.7  4 

5 

6 
7 

1 

2 

3 

.8  4 

5 

6 


I-  1 

2 

3 

:  .9  4 

5 

6 
7 

3  i 

| 

1  .95  4 

5 

I  6 

1  7 


1 


1 

2 

3 

.  999  4 

5 

6 
7 


ft 
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0.503 

0.556 

0.032 

0.612 

0.621 

0.079 

0.662 

0.637 

0.111 

0.692 

0.641 

0.134 

0.712 

0.642 

0.152 

0.727 

0.642 

0.165 

0.738 

0.640 

0.176 

0.628 

0.632 

0.079 

0.723 

0.648 

0.152 

0.765 

0.639 

0.197 

0.789 

0.630 

0.226 

0.  805 

0.621 

0.248 

0.817 

0.614 

0.265 

0.826 

0.607 

0.278 

0.781 

0.640 

0.207 

0.847 

0.589 

0.311 

0.874 

0.555 

0.365 

0.890 

0.532 

0.400 

0.900 

0.515 

0.424 

0.907 

0.502 

0.442 

0.913 

0.491 

0.457 

0.876 

0.552 

0.368 

0.917 

0.472 

0.478 

0.934 

0.430 

0.531 

0.943 

0.404 

0.563 

0.949 

0.385 

0.585 

0.953 

0.372 

0.602 

0.956 

0.361 

0.615 

0.970 

0.280 

0.703 

0.982 

0.214 

0.776 

0.986 

0.185 

0.807 

0.988 

0.168 

0.826 

0.990 

0.157 

0.838 

0.991 

0.148 

0.847 

0.991 

0.142 

0.854 

0.997 

0.068 

0.930 

0.998 

0.048 

0.951 

0.999 

0.039 

0.960 

0.999 

0.035 

0.965 

0.999 

0.032 

0.968 

0.999 

0.030 

0.970 

0.999 

0.028 

0.972 

0.588 

0.700 

0.748 

0.775 

0.794 

0.807 

0.817 

0.712 

0.800 

0.836 

0.856 

0.869 

0.879 

0.886 

0.847 

0.900 

0.920 

0.932 

0.939 

0.944 

0.948 

0.920 

0.950 

0.961 

0.967 

0.971 

0.973 

0.975 

0.983 

0.990 

0.992 

0.994 

0.995 

0.995 

0.996 

0.998 

0.999 

0.999 

0.999 

0.999 

1.000 

1.000 


A.  - 


Table  A2 

.  1  -  Numerical 

Values  for 

bk 

k 

p 

1 

2 

3 

4 

5 

6 

7 

a 

0.126 

0.459 

0.764 

1.031 

1.269 

1.485 

1.683 

•  1 

bk(a) 

0.995 

0.948 

0.887 

0.831 

0.782 

0.741 

0.705 

a 

0.674 

1.177 

1.538 

1.832 

2.086 

2.313 

2.519 

•  5 

bk(a) 

0.857 

0.693 

0.593 

0.526 

0.477 

0.440 

0.410 

a 

1.282 

1.794 

2.154 

2.447 

2.700 

2.925 

3.131 

•  8 

fck(a) 

0.562 

0.402 

0.326 

0.281 

0.249 

0.226 

0.208 

a 

1.960 

2.448 

2.795 

3.080 

3.327 

3.548 

3.751 

95 

bk(a) 

0.241 

0.158 

0.123 

0.103 

0.090 

0.081 

0.  074 

a 

2.576 

3.035 

3.368 

3.644 

3.884 

4.100 

4.2  98 

99 

bu  (a) 

0.  075 

0.  047 

0.  035 

0.029 

0.025 

0.022 

0.020 
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